Extracting Structured Institution Data with LangChain

Posted by Anonymous and classified in Computers

Written on in English with a size of 3.45 KB

Automated Data Extraction Using LangChain and Cohere

This script demonstrates how to leverage LangChain, Cohere, and the Wikipedia API to extract structured information about institutions. By using Pydantic, we ensure the extracted data follows a strict schema for better reliability.

Required Libraries and Data Models

First, we import the necessary modules and define our data structure using a Pydantic model to store institution details such as the founder, founding year, and employee count.

from langchain_community.llms import Cohere
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnableLambda
from pydantic import BaseModel
import wikipediaapi

class InstitutionDetails(BaseModel):
    founder: str
    founded: str
    branches: str
    employees: str
    summary: str

Fetching Data from Wikipedia

The following function connects to the Wikipedia API to retrieve a summary of a specific institution. It includes a user agent to comply with Wikipedia's best practices.

def fetch_wikipedia_summary(institution_name, max_chars=3000):
    wiki = wikipediaapi.Wikipedia(
        language='en',
        user_agent='InstitutionInfoBot/1.0 (https://www.wikipedia.org/)'
    )
    page = wiki.page(institution_name)
    if not page.exists():
        return "No information available."
    return page.text[:max_chars]

Prompt Engineering for LLM Extraction

We define a PromptTemplate to instruct the Cohere LLM on how to parse the raw text into a structured format.

prompt_template = """ Extract the following information from the given text:
- Founder
- Founded (year)
- Current branches
- Number of employees
- 4-line brief summary 

Text: {text} 

Format:
Founder: 
Founded: 
Branches: 
Employees: 
Summary: """

Main Execution and Response Parsing

The main block handles user input, invokes the LLM chain, and parses the response into the InstitutionDetails model.

if __name__ == "__main__":
    institution_name = input("Enter the name of the institution: ")
    wiki_text = fetch_wikipedia_summary(institution_name)

    # Replace with your real API key
    llm = Cohere(cohere_api_key="tqsm07ZW41w8TLbWAibcE4JpKvKtimKgzFGcGFAU")
    prompt = PromptTemplate.from_template(prompt_template)

    # Chain prompt to LLM
    chain = prompt | llm

    # Get output
    response = chain.invoke({"text": wiki_text})

    try:
        # Process the response
        lines = response.strip().split('\n')
        info = {line.split(':')[0].lower(): ':'.join(line.split(':')[1:]).strip() for line in lines if ':' in line}

        # Parse into structured form
        details = InstitutionDetails(
            founder=info.get("founder", "N/A"),
            founded=info.get("founded", "N/A"),
            branches=info.get("branches", "N/A"),
            employees=info.get("employees", "N/A"),
            summary=info.get("summary", "N/A")
        )

        print("\nInstitution Details:")
        print(f"Founder: {details.founder}")
        print(f"Founded: {details.founded}")
        print(f"Branches: {details.branches}")
        print(f"Employees: {details.employees}")
        print(f"Summary: {details.summary}")
    except Exception as e:
        print("Error parsing response:", e)

Related entries: