Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Building Your AI Q&A Bot for Webpages Using Open Source AI Models

    Building Your AI Q&A Bot for Webpages Using Open Source AI Models

    April 4, 2025

    In today’s information-rich digital landscape, navigating extensive web content can be overwhelming. Whether you’re researching for a project, studying complex material, or trying to extract specific information from lengthy articles, the process can be time-consuming and inefficient. This is where an AI-powered Question-Answering (Q&A) bot becomes invaluable.

    This tutorial will guide you through building a practical AI Q&A system that can analyze webpage content and answer specific questions. Instead of relying on expensive API services, we’ll utilize open-source models from Hugging Face to create a solution that’s:

    • Completely free to use
    • Runs in Google Colab (no local setup required)
    • Customizable to your specific needs
    • Built on cutting-edge NLP technology

    By the end of this tutorial, you’ll have a functional web Q&A system that can help you extract insights from online content more efficiently.

    What We’ll Build

    We’ll create a system that:

    1. Takes a URL as input
    2. Extracts and processes the webpage content
    3. Accepts natural language questions about the content
    4. Provides accurate, contextual answers based on the webpage

    Prerequisites

    • A Google account to access Google Colab
    • Basic understanding of Python
    • No prior machine learning knowledge required

    Step 1: Setting Up the Environment

    First, let’s create a new Google Colab notebook. Go to Google Colab and create a new notebook.

    Let’s start by installing the necessary libraries:

    # Install required packages

    Copy CodeCopiedUse a different Browser
    !pip install transformers torch beautifulsoup4 requests

    This installs:

    • transformers: Hugging Face’s library for state-of-the-art NLP models
    • torch: PyTorch deep learning framework
    • beautifulsoup4: For parsing HTML and extracting web content
    • requests: For making HTTP requests to webpages

    Step 2: Import Libraries and Set Up Basic Functions

    Now let’s import all the necessary libraries and define some helper functions:

    Copy CodeCopiedUse a different Browser
    import torch
    from transformers import AutoModelForQuestionAnswering, AutoTokenizer
    import requests
    from bs4 import BeautifulSoup
    import re
    import textwrap

    # Check if GPU is available

    Copy CodeCopiedUse a different Browser
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")

    # Function to extract text from a webpage

    Copy CodeCopiedUse a different Browser
    def extract_text_from_url(url):
       try:
           headers = {
               'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
           }
           response = requests.get(url, headers=headers)
           response.raise_for_status()  
           soup = BeautifulSoup(response.text, 'html.parser')
    
    
           for script_or_style in soup(['script', 'style', 'header', 'footer', 'nav']):
               script_or_style.decompose()
    
    
           text = soup.get_text()
    
    
           lines = (line.strip() for line in text.splitlines())
           chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
           text = 'n'.join(chunk for chunk in chunks if chunk)
    
    
           text = re.sub(r's+', ' ', text).strip()
    
    
           return text
    
    
       except Exception as e:
           print(f"Error extracting text from URL: {e}")
           return None

    This code:

    1. Imports all necessary libraries
    2. Sets up our device (GPU if available, otherwise CPU)
    3. Creates a function to extract readable text content from a webpage URL

    Step 3: Load the Question-Answering Model

    Now let’s load a pre-trained question-answering model from Hugging Face:

    # Load pre-trained model and tokenizer

    Copy CodeCopiedUse a different Browser
    model_name = "deepset/roberta-base-squad2"
    
    
    print(f"Loading model: {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForQuestionAnswering.from_pretrained(model_name).to(device)
    print("Model loaded successfully!")

    We’re using deepset/roberta-base-squad2, which is:

    • Based on RoBERTa architecture (a robustly optimized BERT approach)
    • Fine-tuned on SQuAD 2.0 (Stanford Question Answering Dataset)
    • A good balance between accuracy and speed for our task

    Step 4: Implement the Question-Answering Function

    Now, let’s implement the core functionality – the ability to answer questions based on the extracted webpage content:

    Copy CodeCopiedUse a different Browser
    def answer_question(question, context, max_length=512):
       max_chunk_size = max_length - len(tokenizer.encode(question)) - 5  
       all_answers = []
    
    
       for i in range(0, len(context), max_chunk_size):
           chunk = context[i:i + max_chunk_size]
    
    
           inputs = tokenizer(
               question,
               chunk,
               add_special_tokens=True,
               return_tensors="pt",
               max_length=max_length,
               truncation=True
           ).to(device)
    
    
           with torch.no_grad():
               outputs = model(**inputs)
    
    
           answer_start = torch.argmax(outputs.start_logits)
           answer_end = torch.argmax(outputs.end_logits)
    
    
           start_score = outputs.start_logits[0][answer_start].item()
           end_score = outputs.end_logits[0][answer_end].item()
           score = start_score + end_score
    
    
           input_ids = inputs.input_ids.tolist()[0]
           tokens = tokenizer.convert_ids_to_tokens(input_ids)
    
    
           answer = tokenizer.convert_tokens_to_string(tokens[answer_start:answer_end+1])
    
    
           answer = answer.replace("[CLS]", "").replace("[SEP]", "").strip()
    
    
           if answer and len(answer) > 2:  
               all_answers.append((answer, score))
    
    
       if all_answers:
           all_answers.sort(key=lambda x: x[1], reverse=True)
           return all_answers[0][0]
       else:
           return "I couldn't find an answer in the provided content."

    This function:

    1. Takes a question and the webpage content as input
    2. Handles long content by processing it in chunks
    3. Uses the model to predict the answer span (start and end positions)
    4. Processes multiple chunks and returns the answer with the highest confidence score

    Step 5: Testing and Examples

    Let’s test our system with some examples. Here’s the complete code:

    Copy CodeCopiedUse a different Browser
    url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
    webpage_text = extract_text_from_url(url)
    
    
    print("Sample of extracted text:")
    print(webpage_text[:500] + "...")
    
    
    questions = [
       "When was the term artificial intelligence first used?",
       "What are the main goals of AI research?",
       "What ethical concerns are associated with AI?"
    ]
    
    
    for question in questions:
       print(f"nQuestion: {question}")
       answer = answer_question(question, webpage_text)
       print(f"Answer: {answer}")

    This will demonstrate how the system works with real examples.

    Output of the above code

    Limitations and Future Improvements

    Our current implementation has some limitations:

    1. It can struggle with very long webpages due to context length limitations
    2. The model may not understand complex or ambiguous questions
    3. It works best with factual content rather than opinions or subjective material

    Future improvements could include:

    • Implementing semantic search to better handle long documents
    • Adding document summarization capabilities
    • Supporting multiple languages
    • Implementing memory of previous questions and answers
    • Fine-tuning the model on specific domains (e.g., medical, legal, technical)

    Conclusion

    Now you’ve successfully built your AI-powered Q&A system for webpages using open-source models. This tool can help you:

    • Extract specific information from lengthy articles
    • Research more efficiently
    • Get quick answers from complex documents

    By utilizing Hugging Face’s powerful models and the flexibility of Google Colab, you’ve created a practical application that demonstrates the capabilities of modern NLP. Feel free to customize and extend this project to meet your specific needs.

    Useful Resources

    • Hugging Face Transformers Documentation
    • More about Question Answering Models
    • SQuAD Dataset Information
    • BeautifulSoup Documentation

    Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post Building Your AI Q&A Bot for Webpages Using Open Source AI Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow do you check for the equivalent of ‘deceptive design’ for coding in software?
    Next Article Augment Code Released Augment SWE-bench Verified Agent: An Open-Source Agent Combining Claude Sonnet 3.7 and OpenAI O1 to Excel in Complex Software Engineering Tasks

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Adopt the PACE Framework with IBM watsonx.governance

    Development

    Why developer expertise matters more than ever in the age of AI

    News & Updates

    wfview controls icom ham radios

    Linux

    Need a power bank you can keep in your pocket? I found one for $22 (and it packs a punch)

    News & Updates

    Highlights

    Development

    How to Reduce Technical Debt in the Power Platform

    June 6, 2025

    Technical debt refers to the future cost – measured in terms of time, money, effort,…

    Known insider kills the Titanfall 3 dream, shooting down recent leak: “Titanfall 3 isn’t real”

    Known insider kills the Titanfall 3 dream, shooting down recent leak: “Titanfall 3 isn’t real”

    April 11, 2025

    Operation AkaiRyū: MirrorFace invites Europe to Expo 2025 and revives ANEL backdoor

    April 10, 2025

    CVE-2025-6164 – TOTOLINK A3002R HTTP POST Request Handler Buffer Overflow Vulnerability

    June 17, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.