Skip to content
Vladimir Chavkov
Go back

Building a RAG Application with Azure OpenAI and AI Search

Edit page

Large language models generate fluent text, but they can only work with what they learned during training. Ask a model about your company’s internal documentation and it will either hallucinate an answer or admit it does not know. Retrieval-Augmented Generation (RAG) solves this by fetching relevant documents at query time and injecting them into the prompt as context.

The RAG Architecture

A RAG system has two phases:

Ingestion (offline): Documents are split into chunks, each chunk is converted to a vector embedding, and the chunks with their embeddings are stored in a search index.

Retrieval and generation (online): When a user asks a question, the question is embedded, similar chunks are retrieved from the index, and those chunks are included in the prompt sent to the LLM.

User Question
|
v
[Embed Query] --> [Vector Search in AI Search] --> [Top-K Chunks]
|
v
[Construct Prompt with Chunks] --> [Azure OpenAI] --> Answer

On Azure, this pattern uses Azure AI Search for indexing and retrieval and Azure OpenAI Service for embeddings and text generation.

Step 1: Document Ingestion and Chunking

Raw documents (PDFs, Word files, HTML pages) need to be split into manageable pieces. Chunking strategy matters because it determines what the model sees as context.

Common approaches:

For most use cases, 512-token chunks with 50-token overlap provide a good balance.

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
# Extract text from PDF using Azure Document Intelligence
doc_client = DocumentIntelligenceClient(
endpoint="https://<your-resource>.cognitiveservices.azure.com",
credential=credential,
)
with open("handbook.pdf", "rb") as f:
poller = doc_client.begin_analyze_document("prebuilt-read", body=f)
result = poller.result()
full_text = " ".join([page.content for page in result.pages])

Step 2: Generate Embeddings

Convert each text chunk into a vector using an embedding model. Azure OpenAI provides text-embedding-3-large (3072 dimensions) and text-embedding-3-small (1536 dimensions). The small model is cost-effective for most RAG applications.

from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint="https://<your-resource>.openai.azure.com",
azure_deployment="text-embedding-3-small",
api_version="2024-10-21",
azure_ad_token_provider=credential.get_token(
"https://cognitiveservices.azure.com/.default"
).token,
)
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(input=[text], model="text-embedding-3-small")
return response.data[0].embedding
chunks = split_into_chunks(full_text, chunk_size=512, overlap=50)
embeddings = [get_embedding(chunk) for chunk in chunks]

Create a search index with both text and vector fields. Azure AI Search supports hybrid search, combining keyword (BM25) and vector similarity in a single query.

from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
SearchIndex,
SimpleField,
SearchableField,
SearchField,
SearchFieldDataType,
VectorSearch,
HnswAlgorithmConfiguration,
VectorSearchProfile,
)
index_client = SearchIndexClient(
endpoint="https://<your-search>.search.windows.net",
credential=credential,
)
index = SearchIndex(
name="documents",
fields=[
SimpleField(name="id", type=SearchFieldDataType.String, key=True),
SearchableField(name="content", type=SearchFieldDataType.String),
SearchField(
name="embedding",
type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
searchable=True,
vector_search_dimensions=1536,
vector_search_profile_name="default-profile",
),
],
vector_search=VectorSearch(
algorithms=[HnswAlgorithmConfiguration(name="default-algo")],
profiles=[
VectorSearchProfile(
name="default-profile", algorithm_configuration_name="default-algo"
)
],
),
)
index_client.create_or_update_index(index)

Upload the chunks with their embeddings:

from azure.search.documents import SearchClient
search_client = SearchClient(
endpoint="https://<your-search>.search.windows.net",
index_name="documents",
credential=credential,
)
documents = [
{"id": str(i), "content": chunk, "embedding": emb}
for i, (chunk, emb) in enumerate(zip(chunks, embeddings))
]
search_client.upload_documents(documents)

Step 4: Retrieve and Generate

At query time, embed the user’s question, search for relevant chunks, and construct a prompt.

from azure.search.documents.models import VectorizedQuery
def ask(question: str) -> str:
# Embed the question
query_embedding = get_embedding(question)
# Hybrid search: keyword + vector
results = search_client.search(
search_text=question,
vector_queries=[
VectorizedQuery(
vector=query_embedding,
k_nearest_neighbors=5,
fields="embedding",
)
],
top=5,
)
# Build context from top results
context = "\n\n---\n\n".join([doc["content"] for doc in results])
# Generate answer with context
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Answer the user's question based on the provided context. "
"If the context does not contain enough information, say so. "
"Do not make up information."
),
},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
],
)
return completion.choices[0].message.content

Key Considerations

RAG is the most practical pattern for grounding LLMs in your own data. Azure AI Search handles the retrieval complexity with hybrid and semantic ranking, and Azure OpenAI provides the generation capability. Start with a small document set, measure quality with evaluation, and iterate on your chunking and prompt strategy.


Edit page
Share this post on:

Previous Post
Python Data Engineering: Building Production Pipelines with Apache Airflow and dbt
Next Post
Getting Started with Azure AI Foundry: Build Your First AI Application