Large language models generate fluent text, but they can only work with what they learned during training. Ask a model about your company’s internal documentation and it will either hallucinate an answer or admit it does not know. Retrieval-Augmented Generation (RAG) solves this by fetching relevant documents at query time and injecting them into the prompt as context.
The RAG Architecture
A RAG system has two phases:
Ingestion (offline): Documents are split into chunks, each chunk is converted to a vector embedding, and the chunks with their embeddings are stored in a search index.
Retrieval and generation (online): When a user asks a question, the question is embedded, similar chunks are retrieved from the index, and those chunks are included in the prompt sent to the LLM.
User Question | v[Embed Query] --> [Vector Search in AI Search] --> [Top-K Chunks] | v [Construct Prompt with Chunks] --> [Azure OpenAI] --> AnswerOn Azure, this pattern uses Azure AI Search for indexing and retrieval and Azure OpenAI Service for embeddings and text generation.
Step 1: Document Ingestion and Chunking
Raw documents (PDFs, Word files, HTML pages) need to be split into manageable pieces. Chunking strategy matters because it determines what the model sees as context.
Common approaches:
- Fixed-size chunks — Split every N tokens (e.g., 512 tokens) with overlap. Simple but can break mid-sentence.
- Semantic chunks — Split at paragraph or section boundaries. Preserves meaning but produces variable-size chunks.
- Sliding window — Fixed size with 10-20% overlap between consecutive chunks. Reduces information loss at boundaries.
For most use cases, 512-token chunks with 50-token overlap provide a good balance.
from azure.ai.documentintelligence import DocumentIntelligenceClientfrom azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
# Extract text from PDF using Azure Document Intelligencedoc_client = DocumentIntelligenceClient( endpoint="https://<your-resource>.cognitiveservices.azure.com", credential=credential,)
with open("handbook.pdf", "rb") as f: poller = doc_client.begin_analyze_document("prebuilt-read", body=f) result = poller.result()
full_text = " ".join([page.content for page in result.pages])Step 2: Generate Embeddings
Convert each text chunk into a vector using an embedding model. Azure OpenAI provides text-embedding-3-large (3072 dimensions) and text-embedding-3-small (1536 dimensions). The small model is cost-effective for most RAG applications.
from openai import AzureOpenAI
client = AzureOpenAI( azure_endpoint="https://<your-resource>.openai.azure.com", azure_deployment="text-embedding-3-small", api_version="2024-10-21", azure_ad_token_provider=credential.get_token( "https://cognitiveservices.azure.com/.default" ).token,)
def get_embedding(text: str) -> list[float]: response = client.embeddings.create(input=[text], model="text-embedding-3-small") return response.data[0].embedding
chunks = split_into_chunks(full_text, chunk_size=512, overlap=50)embeddings = [get_embedding(chunk) for chunk in chunks]Step 3: Index in Azure AI Search
Create a search index with both text and vector fields. Azure AI Search supports hybrid search, combining keyword (BM25) and vector similarity in a single query.
from azure.search.documents.indexes import SearchIndexClientfrom azure.search.documents.indexes.models import ( SearchIndex, SimpleField, SearchableField, SearchField, SearchFieldDataType, VectorSearch, HnswAlgorithmConfiguration, VectorSearchProfile,)
index_client = SearchIndexClient( endpoint="https://<your-search>.search.windows.net", credential=credential,)
index = SearchIndex( name="documents", fields=[ SimpleField(name="id", type=SearchFieldDataType.String, key=True), SearchableField(name="content", type=SearchFieldDataType.String), SearchField( name="embedding", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), searchable=True, vector_search_dimensions=1536, vector_search_profile_name="default-profile", ), ], vector_search=VectorSearch( algorithms=[HnswAlgorithmConfiguration(name="default-algo")], profiles=[ VectorSearchProfile( name="default-profile", algorithm_configuration_name="default-algo" ) ], ),)
index_client.create_or_update_index(index)Upload the chunks with their embeddings:
from azure.search.documents import SearchClient
search_client = SearchClient( endpoint="https://<your-search>.search.windows.net", index_name="documents", credential=credential,)
documents = [ {"id": str(i), "content": chunk, "embedding": emb} for i, (chunk, emb) in enumerate(zip(chunks, embeddings))]
search_client.upload_documents(documents)Step 4: Retrieve and Generate
At query time, embed the user’s question, search for relevant chunks, and construct a prompt.
from azure.search.documents.models import VectorizedQuery
def ask(question: str) -> str: # Embed the question query_embedding = get_embedding(question)
# Hybrid search: keyword + vector results = search_client.search( search_text=question, vector_queries=[ VectorizedQuery( vector=query_embedding, k_nearest_neighbors=5, fields="embedding", ) ], top=5, )
# Build context from top results context = "\n\n---\n\n".join([doc["content"] for doc in results])
# Generate answer with context completion = client.chat.completions.create( model="gpt-4o", messages=[ { "role": "system", "content": ( "Answer the user's question based on the provided context. " "If the context does not contain enough information, say so. " "Do not make up information." ), }, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}, ], )
return completion.choices[0].message.contentKey Considerations
- Chunk size affects quality. Too small and you lose context. Too large and you dilute relevance. Experiment with 256-1024 token ranges.
- Hybrid search outperforms vector-only. Combining BM25 keyword matching with vector similarity consistently produces better retrieval results.
- System prompts matter. Instruct the model to only use provided context and to acknowledge when it lacks information. This reduces hallucination.
- Evaluate end to end. Measure retrieval precision (are the right chunks returned?) and generation quality (is the answer correct and grounded?) separately. Azure AI Foundry provides built-in evaluation tools for this.
RAG is the most practical pattern for grounding LLMs in your own data. Azure AI Search handles the retrieval complexity with hybrid and semantic ranking, and Azure OpenAI provides the generation capability. Start with a small document set, measure quality with evaluation, and iterate on your chunking and prompt strategy.