Simple RAG App Setup: Retrieve Text from PDFs with AI

Building an AI app sounds complicated—but it doesn’t have to be. In this tutorial, you’ll create a simple Retrieval-Augmented Generation (RAG) app that extracts text from PDFs and answers questions about them using an AI model like Claude. Even if you’re brand new to AI, this guide will walk you through every step.

By the end, you’ll have a basic app that:

  • Accepts a PDF file
  • Extracts its content
  • Lets you ask questions about it
  • Returns AI-powered answers

What You'll Need / Dependencies:

To follow along, install the following:

  1. Python (3.8+)
    Download: https://www.python.org/downloads/
  2. Virtual Environment (recommended)
    Create one in your project folder:
    python -m venv venv  
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install required packages:
    In your terminal, run:
    pip install streamlit pypdf sentence-transformers faiss-cpu anthropic
  4. Get an API Key for Claude
    Sign up at Anthropic Console and create a new API key.
    Save it in a .env file or your system environment as ANTHROPIC_API_KEY.

Step-by-Step Instructions:

1. Create Your Project Folder

Make a new folder and open it in your code editor. Inside, create these files:

  • app.py (main Streamlit app)
  • utils.py (for PDF and embedding logic)
  • .env (to store your Claude API key)

2. Write PDF Text Extraction Code

In utils.py, add:

from PyPDF2 import PdfReader

def extract_text_from_pdf(file):
    reader = PdfReader(file)
    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""
    return text

3. Add Embedding & Search

Also in utils.py, add:

import numpy as np
from sentence_transformers import SentenceTransformer
import faiss

embedder = SentenceTransformer("all-MiniLM-L6-v2")

def embed_text_chunks(chunks):
    return embedder.encode(chunks, convert_to_numpy=True)

def build_faiss_index(vectors):
    dim = vectors.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(vectors)
    return index

def get_top_chunks(query, chunks, index, chunk_vectors, k=3):
    query_vec = embedder.encode([query])
    distances, indices = index.search(np.array(query_vec), k)
    return [chunks[i] for i in indices[0]]

4. Create the Streamlit App

In app.py, paste:

import streamlit as st
from utils import extract_text_from_pdf, embed_text_chunks, build_faiss_index, get_top_chunks
from dotenv import load_dotenv
import os
import anthropic

load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

st.set_page_config(page_title="Ask My PDF", layout="wide")
st.title("📄 Ask Questions About Your PDF")

uploaded_file = st.file_uploader("Upload a PDF", type=["pdf"])
query = st.text_input("What do you want to know?")

if uploaded_file and query:
    with st.spinner("Processing..."):
        text = extract_text_from_pdf(uploaded_file)
        chunks = [text[i:i+500] for i in range(0, len(text), 500)]
        vectors = embed_text_chunks(chunks)
        index = build_faiss_index(vectors)
        top_chunks = get_top_chunks(query, chunks, index, vectors)

        system_prompt = "Answer the user's question using only the provided context. If unsure, say so."
        user_prompt = f"Context:\n{''.join(top_chunks)}\n\nQuestion: {query}"

        response = client.messages.create(
            model="claude-3-haiku-20240307",
            system=system_prompt,
            messages=[{"role": "user", "content": user_prompt}],
            max_tokens=500
        )

        st.subheader("Answer:")
        st.write(response.content[0].text)

5. Run the App

Back in your terminal, run:

streamlit run app.py

Upload a PDF, type your question, and watch the magic happen!

Practical Examples / Code:

Here’s an example question you might ask a contract PDF:

“When does the agreement expire?”

Claude will search the text chunks of the contract and return something like:

The agreement expires on June 30, 2026, unless terminated earlier by either party.

Best Practices & Tips:

  • Chunk size matters: Too small and you lose context. Too large and you confuse the embedder. Start with ~500 characters.
  • Always check source context: Claude may hallucinate if the context chunks are too vague.
  • Use smaller models for faster local prototyping (like all-MiniLM-L6-v2).
  • Claude works best when you instruct it clearly in the system prompt.

Common issue?
Q: "Why do I get no results?"
A: Check that your PDF actually contains extractable text (some PDFs are just scanned images).

Conclusion & Recap:

You just built a basic RAG (Retrieval-Augmented Generation) app that:

  • Extracts text from a PDF
  • Embeds the content
  • Retrieves the most relevant parts
  • Uses Claude AI to answer your question

That’s a powerful setup for document Q&A—and you did it with just a few Python files!

What did you build today? Let us know!

Comments

Popular posts from this blog

Turn Prompts into Reusable Templates: Build Your AI Toolkit

Create Your First AI Assistant Using Prompt Engineering

Beginner's Guide to Prompt Engineering