In this blog we will talk about how to build a simple News AI Assistant. We will be using Google News python library to fetch the news and then we will use the ChromaDB to do semantic search. After that we will use Retrieval Augmented Generation (RAG) model to generate the answer for the question.
Here is the video of this blog:
Also subscribe to my youtube channel for more such videos: We start with installing Google News python library.
!pip install GoogleNews
Initialize the Google News object for fetching the news.
from GoogleNews import GoogleNews
googlenews = GoogleNews()
print(googlenews.getVersion())
Let's set the period to last 2 days.
googlenews = GoogleNews(period='2d')
Set the encoding to utf-8.
googlenews = GoogleNews(encode='utf-8')
Here we set the topic to "Sports".
googlenews.set_topic('CAAqKggKIiRDQkFTRlFvSUwyMHZNRFp1ZEdvU0JXVnVMVWRDR2dKSlRpZ0FQAQ')
googlenews.get_news()
And, now we can fetch the titles of the news.
titles = googlenews.get_texts()
titles
Let's now install the ChromaDB library. We will use this as our vector database for doing semantic search.
!pip install chromadb
Initialize the ChromaDB object. Also setup the connection.
import chromadb
# setup Chroma in-memory, for easy prototyping. Can add persistence easily!
client = chromadb.Client()
# Create collection. get_collection, get_or_create_collection, delete_collection also available!
collection = client.create_collection("all-my-documents")
We need the ids for the titles from the news. So, we will create a list of ids.
# create ids for injecting documents they will be index of the title for now in string format
ids = [str(i) for i in range(len(titles))]
And metadata for the titles.
# list of metadata for each title, for now each title metadata is just dictionary with index as key and title as value
title_metadata = [{str(i): title} for i, title in enumerate(titles)]
We insert the vectors into the ChromaDB.
# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
documents=titles, # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as well
metadatas= title_metadata, # filter on these!
ids=ids, # unique for each doc
)
Now, we can query the ChromaDB for the similar titles.
# Query/search 2 most similar results. You can also .get by id
results = collection.query(
query_texts=["What is happening with test matches?"], # query text"],
n_results=2,
# where={"metadata_field": "is_equal_to_this"}, # optional filter
# where_document={"$contains":"search_string"} # optional filter
)
And we can see that we can now get the similar titles.
results
Let's dump the output to a json variable.
import json
context = json.dumps(results["documents"]);
context
Let's now setup the OpenAI API key.
import getpass
openai_key = getpass.getpass("Enter your OpenAI key: ")
!export OPENAI_API_KEY=$openai_key
Download the Langchain OpenAI library.
!pip install langchain-openai
Now we can setup the prompts and parameters for the RAG model.
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI
llm = OpenAI(api_key=openai_key)
prompt = PromptTemplate.from_template(
"""
You are a new assistant. User ask you a question and you have a context.
Question: {question}
Context: {context}
Based on the the question and the context, answer the question. Do not provide any information that is not present in the context.
Don't mention about the context in the answer.
Stick as close to the question as possible.
"""
)
chain = prompt | llm
chain.invoke({"question": "What is happening with test matches?", "context": context })
We have successfully built a simple News AI Assistant using Google News python library, ChromaDB, and Retrieval Augmented Generation (RAG) model. This can be further extended to build a more complex AI Assistant.
Due a technical issue with Google News, we were not able to fetch the actual news articles. But soon we will update the code with the actual news articles. Stay tuned for more such blogs.