Open WebUI RAG API Guide: Programmatic Document Search

Q: How do I delete uploaded RAG documents through the API?

To delete a document, send a DELETE request to `/api/v1/files/{file_id}`. This will remove the file from the database and purge the associated vector embeddings from the vector store. It is important to perform this cleanup in automated workflows to prevent your storage and index from becoming cluttered with stale or redundant information.

Open WebUI RAG API allows you to programmatically inject external knowledge into your self-hosted LLM interface using Retrieval Augmented Generation (RAG). By leveraging the API, developers can automate document ingestion, manage knowledge collections, and query specific files without manual uploads in the UI. This guide explores the technical implementation of the Open WebUI RAG API, providing practical examples for authentication, file management, and advanced retrieval settings.

What is the Open WebUI RAG API and how does it work?

The Open WebUI RAG API is a set of RESTful endpoints that integrate with the platform's internal document processing pipeline. When you use the API for RAG, the system performs several key steps: it ingests raw file data, splits the text into manageable chunks, generates vector embeddings using a local or remote embedding model, and stores those vectors in the internal database. When a user or an API call queries a model with RAG enabled, the system performs a similarity search to find the most relevant context from your uploaded files.

Unlike standard chat completions, the RAG API flow requires a distinct identifier for the knowledge source. You can reference individual files by their unique UUIDs or group them into Collections. This modularity is particularly useful for building domain-specific assistants. For instance, if you are hosting Open WebUI for a technical support team, you can programmatically update a 'Manuals' collection every time a new PDF is released, ensuring the LLM always has the latest data without human intervention.

Technically, the RAG process is managed through the /api/v1/files and /api/v1/collections endpoints. The system supports various embedding providers, including Ollama, OpenAI, and HuggingFace models. Behind the scenes, Open WebUI uses a vector store (often ChromaDB or a similar engine depending on the version) to handle the multidimensional math required for semantic search. Understanding this architecture is the first step toward building truly intelligent, context-aware AI applications on your own infrastructure.

How do I authenticate with the Open WebUI API for RAG?

Authentication for the Open WebUI RAG API typically involves a Bearer Token found within your user profile settings. Unlike some applications that use simple API keys, Open WebUI uses JSON Web Tokens (JWT) to manage sessions and permissions. To find your token, navigate to 'Settings' > 'Account' once logged into your instance. There is a field labeled 'API Key' or 'JWT' that you can copy to use in your headers. Every request made to the API must include this token in the Authorization header to verify your identity and ensure you have access to the files you are querying.

When making a request via curl or a Python script, your header should look like this: Authorization: Bearer <your_token_here>. It is important to treat this token as a secret, as it grants full access to your Open WebUI account, including the ability to delete files or view private chats. For developers building integrations, it is recommended to use environment variables to store these tokens rather than hard-coding them into scripts. This is especially critical if you are deploying your scripts to a shared environment or version control system.

If you find that your token is not working, ensure that your instance is not behind a misconfigured reverse proxy that might be stripping the Authorization header. Some Nginx setups require explicit configuration to pass through custom headers. Additionally, check that your user account has the necessary permissions to use API features. In most enterprise or private ChatGPT setups, admins can toggle API access on a per-user basis. Once authenticated, you gain access to the full suite of file and knowledge management tools.

How can I upload documents for RAG via the API?

Uploading documents via the Open WebUI RAG API is a multipart/form-data request sent to the /api/v1/files/ endpoint. This is the programmatic equivalent of dragging and dropping a file into the web interface. When you send a file through this endpoint, Open WebUI immediately begins the 'ingestion' process. This involves extracting text from PDFs, Markdown files, or Word documents and sending that text to the configured embedding engine. The result is a 'File ID' which you will need for subsequent chat queries.

To perform an upload using Python's requests library, you would typically open your file in binary mode and pass it into the files parameter. It is a best practice to also provide a purpose or name if the API version supports it, though for RAG, the system usually defaults to document storage. Once the upload is successful, the API returns a JSON response containing the file's unique ID. You must wait for the status to change to 'processed' before the file is searchable; during the ingestion phase, the vectors are still being generated and indexed.

Handling larger files or batch uploads requires attention to timeout settings. If you are uploading a 500-page manual, the embedding process might take several minutes depending on your CPU or GPU resources. In these cases, it is helpful to monitor the progress or implement a retry mechanism. You can check the status of a specific file by hitting the GET endpoint for that File ID. This programmatic ingestion is the backbone of any automated self-hosted RAG pipeline, allowing you to sync local directories directly to your AI's brain.

How do I create and manage Knowledge Collections with the API?

Knowledge Collections in Open WebUI allow you to group multiple files under a single alias. This is highly efficient for the RAG API because instead of passing ten different File IDs in a chat request, you can simply pass one Collection ID. Managing these via the API involves two steps: creating the collection and then tagging or assigning files to it. Use the /api/v1/collections/ POST endpoint to create a new cluster, providing a name and a description that helps identify the content's purpose.

Once a collection is created, you can add files to it by updating the file's metadata or using the dedicated collection-update endpoints. In the API world, this grouping allows for more dynamic prompting. For example, you might have a 'Legal' collection and a 'Marketing' collection. Depending on which department the user belongs to, your backend logic can automatically switch the collection ID injected into the RAG query. This provides a tailored experience without necessitating multiple separate LLM instances.

Deleting or updating collections is equally straightforward. A DELETE request to the specific Collection ID will remove the grouping, although usually the underlying files remain unless specifically deleted as well. This separation of files and collections is a powerful feature for data hygiene. When comparing various open-source LLM UIs, the ability to manage complex knowledge hierarchies via API is a significant advantage of the Open WebUI ecosystem, making it suitable for both hobbyists and enterprise-scale deployments.

How can I perform a chat completion using specific RAG file IDs?

To actually use the documents you've uploaded in a conversation, you must include them in the /api/chat/completions request. Inside the JSON payload, you typically add a files array or a collections field within the metadata. This tells Open WebUI's backend: 'Before answering this question, look into these specific File IDs for context.' The RAG engine then retrieves the top 'k' chunks (usually 4-5) that most closely match the user's query and prepends them to the system prompt.

A typical RAG-enabled API payload includes the model name, the message history, and the reference IDs. It is important to note that referencing too many files at once can lead to 'lost in the middle' phenomena or context window overflows. Effective API usage involves selecting only the most relevant collections. If you are using a local ollama web interface, ensure your hardware can handle the increased token count that RAG context adds to every request.

Advanced users can also tweak the rag_config through the API to adjust parameters like 'top_k' or 'score_threshold.' The score_threshold is particularly useful; it ensures that the model only sees context that is actually relevant to the query. If the similarity score is too low, the API will ignore the document instead of feeding the LLM 'noise' that might lead to hallucinations. By fine-tuning these settings via the API, you can balance speed and accuracy for your specific use case.

How do I configure global RAG settings via environment variables?

While many settings can be adjusted per-request, the core behavior of the RAG engine is often determined by environment variables set at the server level. If you are running Open WebUI via Docker, you can configure variables like RAG_EMBEDDING_ENGINE (e.g., set to 'ollama' or 'openai') and RAG_TOP_K. One of the most critical variables for API performance is RAG_EMBEDDING_CONCURRENT_REQUESTS. This controls how many document chunks are embedded simultaneously; setting this too high can crash a small server, while setting it too low makes ingestion painfully slow.

Another important configuration is the document chunk size, controlled by RAG_CHUNK_SIZE. Smaller chunks (e.g., 512 tokens) provide more granular retrieval but might lose local context, whereas larger chunks (e.g., 1500 tokens) provide more context but risk filling up the LLM's prompt window too quickly. When building an API-centric application, you should test different chunking strategies to see which yields the highest retrieval accuracy for your specific data types (e.g., code vs. prose).

Finally, ensure that ENV=dev is temporarily enabled if you need to access the Swagger documentation at /docs. This interactive API explorer allows you to test RAG endpoints directly from the browser, which is invaluable for debugging schema issues. Once your environment variables are locked in, the Open WebUI RAG API becomes a stable, predictable foundation for your AI projects. Whether you are automating knowledge bases or building custom chat interfaces, these server-side configurations ensure your API remains responsive under load.

Frequently Asked Questions about Open WebUI RAG

How do I enable the Swagger API documentation for Open WebUI?

To enable Swagger documentation, you need to set the environment variable ENV=dev in your Open WebUI configuration (usually in your docker-compose.yml or export it in your shell). Once restart is complete, you can visit http://your-instance-url/docs to view the interactive API reference. This is essential for discovering the specific parameters required for the Open WebUI RAG API as they evolve between versions.

Can I use the Open WebUI RAG API with local Ollama models?

Yes, Open WebUI is designed to work seamlessly with Ollama. You can configure the system to use Ollama for both the main LLM chat and the embedding generation. In your settings or environment variables, set RAG_EMBEDDING_ENGINE=ollama and specify an embedding model like mxbai-embed-large or nomic-embed-text. This allows for a completely air-gapped RAG API experience without any data leaving your server.

What are the supported file formats for document uploads in the RAG API?

Open WebUI supports a wide range of formats including .pdf, .docx, .txt, .md, .csv, and even .pptx files. When uploaded via the API, the system uses internal parsers to extract text. For technical documents, Markdown (.md) is highly recommended as it preserves structure (like headers and lists) which can help the RAG engine better understand the hierarchy of the information during the chunking phase.

How do I delete uploaded RAG documents through the API?

To delete a document, send a DELETE request to /api/v1/files/{file_id}. This will remove the file from the database and purge the associated vector embeddings from the vector store. It is important to perform this cleanup in automated workflows to prevent your storage and index from becoming cluttered with stale or redundant information.

Is there an OpenAI-compatible endpoint for RAG in Open WebUI?

While Open WebUI provides an OpenAI-compatible API for basic chat completions, the RAG-specific features (like file management and collection injection) are unique to the Open WebUI v1 API. To use RAG features, you generally need to use the native Open WebUI endpoints. However, once a file is processed into the RAG engine, its context can be used in chat sessions that otherwise follow the OpenAI standard request structure.

Conclusion

The Open WebUI RAG API transforms a static LLM into a dynamic knowledge hub capable of processing and querying vast amounts of private data. By mastering authentication, programmatic file ingestion, and collection management, you can build sophisticated AI workflows that remain entirely under your control. Whether you are looking to automate your documentation search or scale an internal AI assistant, the API provides the flexibility needed for professional-grade implementations. For the best performance and security, consider hosting Open WebUI on dedicated infrastructure where you can fully tune the environment variables and RAG parameters to your specific needs.

Open WebUI RAG API Guide: Powering Search with AI