Building a RAG Application with Azure OpenAI and AI Search

Large language models generate fluent text, but they can only work with what they learned during training. Ask a model about your company’s internal documentation and it will either hallucinate an answer or admit it does not know. Retrieval-Augmented Generation (RAG) solves this by fetching relevant documents at query time and injecting them into the prompt as context.

The RAG Architecture

A RAG system has two phases:

Ingestion (offline): Documents are split into chunks, each chunk is converted to a vector embedding, and the chunks with their embeddings are stored in a search index.

Retrieval and generation (online): When a user asks a question, the question is embedded, similar chunks are retrieved from the index, and those chunks are included in the prompt sent to the LLM.

User Question
    |
    v
[Embed Query] --> [Vector Search in AI Search] --> [Top-K Chunks]
                                                        |
                                                        v
                              [Construct Prompt with Chunks] --> [Azure OpenAI] --> Answer

On Azure, this pattern uses Azure AI Search for indexing and retrieval and Azure OpenAI Service for embeddings and text generation.

Step 1: Document Ingestion and Chunking

Raw documents (PDFs, Word files, HTML pages) need to be split into manageable pieces. Chunking strategy matters because it determines what the model sees as context.

Common approaches:

Fixed-size chunks — Split every N tokens (e.g., 512 tokens) with overlap. Simple but can break mid-sentence.
Semantic chunks — Split at paragraph or section boundaries. Preserves meaning but produces variable-size chunks.
Sliding window — Fixed size with 10-20% overlap between consecutive chunks. Reduces information loss at boundaries.

For most use cases, 512-token chunks with 50-token overlap provide a good balance.

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

# Extract text from PDF using Azure Document Intelligence
doc_client = DocumentIntelligenceClient(
    endpoint="https://<your-resource>.cognitiveservices.azure.com",
    credential=credential,
)

with open("handbook.pdf", "rb") as f:
    poller = doc_client.begin_analyze_document("prebuilt-read", body=f)
    result = poller.result()

full_text = " ".join([page.content for page in result.pages])

Step 2: Generate Embeddings

Convert each text chunk into a vector using an embedding model. Azure OpenAI provides text-embedding-3-large (3072 dimensions) and text-embedding-3-small (1536 dimensions). The small model is cost-effective for most RAG applications.

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint="https://<your-resource>.openai.azure.com",
    azure_deployment="text-embedding-3-small",
    api_version="2024-10-21",
    azure_ad_token_provider=credential.get_token(
        "https://cognitiveservices.azure.com/.default"
    ).token,
)

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(input=[text], model="text-embedding-3-small")
    return response.data[0].embedding

chunks = split_into_chunks(full_text, chunk_size=512, overlap=50)
embeddings = [get_embedding(chunk) for chunk in chunks]

Step 3: Index in Azure AI Search

Create a search index with both text and vector fields. Azure AI Search supports hybrid search, combining keyword (BM25) and vector similarity in a single query.

from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SimpleField,
    SearchableField,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
)

index_client = SearchIndexClient(
    endpoint="https://<your-search>.search.windows.net",
    credential=credential,
)

index = SearchIndex(
    name="documents",
    fields=[
        SimpleField(name="id", type=SearchFieldDataType.String, key=True),
        SearchableField(name="content", type=SearchFieldDataType.String),
        SearchField(
            name="embedding",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=1536,
            vector_search_profile_name="default-profile",
        ),
    ],
    vector_search=VectorSearch(
        algorithms=[HnswAlgorithmConfiguration(name="default-algo")],
        profiles=[
            VectorSearchProfile(
                name="default-profile", algorithm_configuration_name="default-algo"
            )
        ],
    ),
)

index_client.create_or_update_index(index)

Upload the chunks with their embeddings:

from azure.search.documents import SearchClient

search_client = SearchClient(
    endpoint="https://<your-search>.search.windows.net",
    index_name="documents",
    credential=credential,
)

documents = [
    {"id": str(i), "content": chunk, "embedding": emb}
    for i, (chunk, emb) in enumerate(zip(chunks, embeddings))
]

search_client.upload_documents(documents)

Step 4: Retrieve and Generate

At query time, embed the user’s question, search for relevant chunks, and construct a prompt.

from azure.search.documents.models import VectorizedQuery

def ask(question: str) -> str:
    # Embed the question
    query_embedding = get_embedding(question)

    # Hybrid search: keyword + vector
    results = search_client.search(
        search_text=question,
        vector_queries=[
            VectorizedQuery(
                vector=query_embedding,
                k_nearest_neighbors=5,
                fields="embedding",
            )
        ],
        top=5,
    )

    # Build context from top results
    context = "\n\n---\n\n".join([doc["content"] for doc in results])

    # Generate answer with context
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer the user's question based on the provided context. "
                    "If the context does not contain enough information, say so. "
                    "Do not make up information."
                ),
            },
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )

    return completion.choices[0].message.content

Key Considerations

Chunk size affects quality. Too small and you lose context. Too large and you dilute relevance. Experiment with 256-1024 token ranges.
Hybrid search outperforms vector-only. Combining BM25 keyword matching with vector similarity consistently produces better retrieval results.
System prompts matter. Instruct the model to only use provided context and to acknowledge when it lacks information. This reduces hallucination.
Evaluate end to end. Measure retrieval precision (are the right chunks returned?) and generation quality (is the answer correct and grounded?) separately. Azure AI Foundry provides built-in evaluation tools for this.

RAG is the most practical pattern for grounding LLMs in your own data. Azure AI Search handles the retrieval complexity with hybrid and semantic ranking, and Azure OpenAI provides the generation capability. Start with a small document set, measure quality with evaluation, and iterate on your chunking and prompt strategy.