Extracting Structured Institution Data with LangChain
Automated Data Extraction Using LangChain and Cohere
This script demonstrates how to leverage LangChain, Cohere, and the Wikipedia API to extract structured information about institutions. By using Pydantic, we ensure the extracted data follows a strict schema for better reliability.
Required Libraries and Data Models
First, we import the necessary modules and define our data structure using a Pydantic model to store institution details such as the founder, founding year, and employee count.
from langchain_community.llms import Cohere
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnableLambda
from pydantic import BaseModel
import wikipediaapi
class InstitutionDetails(BaseModel):
founder: str
founded: str
branches: str
employees: str
summary: strFetching Data from Wikipedia
The following function connects to the Wikipedia API to retrieve a summary of a specific institution. It includes a user agent to comply with Wikipedia's best practices.
def fetch_wikipedia_summary(institution_name, max_chars=3000):
wiki = wikipediaapi.Wikipedia(
language='en',
user_agent='InstitutionInfoBot/1.0 (https://www.wikipedia.org/)'
)
page = wiki.page(institution_name)
if not page.exists():
return "No information available."
return page.text[:max_chars]Prompt Engineering for LLM Extraction
We define a PromptTemplate to instruct the Cohere LLM on how to parse the raw text into a structured format.
prompt_template = """ Extract the following information from the given text:
- Founder
- Founded (year)
- Current branches
- Number of employees
- 4-line brief summary
Text: {text}
Format:
Founder:
Founded:
Branches:
Employees:
Summary: """Main Execution and Response Parsing
The main block handles user input, invokes the LLM chain, and parses the response into the InstitutionDetails model.
if __name__ == "__main__":
institution_name = input("Enter the name of the institution: ")
wiki_text = fetch_wikipedia_summary(institution_name)
# Replace with your real API key
llm = Cohere(cohere_api_key="tqsm07ZW41w8TLbWAibcE4JpKvKtimKgzFGcGFAU")
prompt = PromptTemplate.from_template(prompt_template)
# Chain prompt to LLM
chain = prompt | llm
# Get output
response = chain.invoke({"text": wiki_text})
try:
# Process the response
lines = response.strip().split('\n')
info = {line.split(':')[0].lower(): ':'.join(line.split(':')[1:]).strip() for line in lines if ':' in line}
# Parse into structured form
details = InstitutionDetails(
founder=info.get("founder", "N/A"),
founded=info.get("founded", "N/A"),
branches=info.get("branches", "N/A"),
employees=info.get("employees", "N/A"),
summary=info.get("summary", "N/A")
)
print("\nInstitution Details:")
print(f"Founder: {details.founder}")
print(f"Founded: {details.founded}")
print(f"Branches: {details.branches}")
print(f"Employees: {details.employees}")
print(f"Summary: {details.summary}")
except Exception as e:
print("Error parsing response:", e)
English with a size of 3.45 KB