Simple RAG App Setup: Retrieve Text from PDFs with AI
Building an AI app sounds complicated—but it doesn’t have to be. In this tutorial, you’ll create a simple Retrieval-Augmented Generation (RAG) app that extracts text from PDFs and answers questions about them using an AI model like Claude. Even if you’re brand new to AI, this guide will walk you through every step.
By the end, you’ll have a basic app that:
- Accepts a PDF file
- Extracts its content
- Lets you ask questions about it
- Returns AI-powered answers
What You'll Need / Dependencies:
To follow along, install the following:
-
Python (3.8+)
Download: https://www.python.org/downloads/ -
Virtual Environment (recommended)
Create one in your project folder:python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install required packages:
In your terminal, run:pip install streamlit pypdf sentence-transformers faiss-cpu anthropic
-
Get an API Key for Claude
Sign up at Anthropic Console and create a new API key.
Save it in a.env
file or your system environment asANTHROPIC_API_KEY
.
Step-by-Step Instructions:
1. Create Your Project Folder
Make a new folder and open it in your code editor. Inside, create these files:
app.py
(main Streamlit app)utils.py
(for PDF and embedding logic).env
(to store your Claude API key)
2. Write PDF Text Extraction Code
In utils.py
, add:
from PyPDF2 import PdfReader
def extract_text_from_pdf(file):
reader = PdfReader(file)
text = ""
for page in reader.pages:
text += page.extract_text() or ""
return text
3. Add Embedding & Search
Also in utils.py
, add:
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
embedder = SentenceTransformer("all-MiniLM-L6-v2")
def embed_text_chunks(chunks):
return embedder.encode(chunks, convert_to_numpy=True)
def build_faiss_index(vectors):
dim = vectors.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(vectors)
return index
def get_top_chunks(query, chunks, index, chunk_vectors, k=3):
query_vec = embedder.encode([query])
distances, indices = index.search(np.array(query_vec), k)
return [chunks[i] for i in indices[0]]
4. Create the Streamlit App
In app.py
, paste:
import streamlit as st
from utils import extract_text_from_pdf, embed_text_chunks, build_faiss_index, get_top_chunks
from dotenv import load_dotenv
import os
import anthropic
load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
st.set_page_config(page_title="Ask My PDF", layout="wide")
st.title("📄 Ask Questions About Your PDF")
uploaded_file = st.file_uploader("Upload a PDF", type=["pdf"])
query = st.text_input("What do you want to know?")
if uploaded_file and query:
with st.spinner("Processing..."):
text = extract_text_from_pdf(uploaded_file)
chunks = [text[i:i+500] for i in range(0, len(text), 500)]
vectors = embed_text_chunks(chunks)
index = build_faiss_index(vectors)
top_chunks = get_top_chunks(query, chunks, index, vectors)
system_prompt = "Answer the user's question using only the provided context. If unsure, say so."
user_prompt = f"Context:\n{''.join(top_chunks)}\n\nQuestion: {query}"
response = client.messages.create(
model="claude-3-haiku-20240307",
system=system_prompt,
messages=[{"role": "user", "content": user_prompt}],
max_tokens=500
)
st.subheader("Answer:")
st.write(response.content[0].text)
5. Run the App
Back in your terminal, run:
streamlit run app.py
Upload a PDF, type your question, and watch the magic happen!
Practical Examples / Code:
Here’s an example question you might ask a contract PDF:
“When does the agreement expire?”
Claude will search the text chunks of the contract and return something like:
The agreement expires on June 30, 2026, unless terminated earlier by either party.
Best Practices & Tips:
- Chunk size matters: Too small and you lose context. Too large and you confuse the embedder. Start with ~500 characters.
- Always check source context: Claude may hallucinate if the context chunks are too vague.
- Use smaller models for faster local prototyping (like
all-MiniLM-L6-v2
). - Claude works best when you instruct it clearly in the system prompt.
Common issue?
Q: "Why do I get no results?"
A: Check that your PDF actually contains extractable text (some PDFs are just scanned images).
Conclusion & Recap:
You just built a basic RAG (Retrieval-Augmented Generation) app that:
- Extracts text from a PDF
- Embeds the content
- Retrieves the most relevant parts
- Uses Claude AI to answer your question
That’s a powerful setup for document Q&A—and you did it with just a few Python files!
What did you build today? Let us know!
Comments
Post a Comment