Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»A Coding Tutorial of Model Context Protocol Focusing on Semantic Chunking, Dynamic Token Management, and Context Relevance Scoring for Efficient LLM Interactions

    A Coding Tutorial of Model Context Protocol Focusing on Semantic Chunking, Dynamic Token Management, and Context Relevance Scoring for Efficient LLM Interactions

    April 28, 2025

    Managing context effectively is a critical challenge when working with large language models, especially in environments like Google Colab, where resource constraints and long documents can quickly exceed available token windows. In this tutorial, we guide you through a practical implementation of the Model Context Protocol (MCP) by building a ModelContextManager that automatically chunks incoming text, generates semantic embeddings using Sentence-Transformers, and scores each chunk based on recency, importance, and relevance. You’ll learn how to integrate this manager with a Hugging Face sequence-to-sequence model, demonstrated here with FLAN-T5, to add, optimize, and retrieve only the most pertinent pieces of context. Along the way, we’ll cover token counting with a GPT-2 tokenizer, context-window optimization strategies, and interactive sessions that let you query and visualize your dynamic context in real time.

    Copy CodeCopiedUse a different Browser
    import torch
    import numpy as np
    from typing import List, Dict, Any, Optional, Union, Tuple
    from dataclasses import dataclass
    import time
    import gc
    from tqdm.notebook import tqdm

    We import essential libraries for building a dynamic context manager: torch and numpy handle tensor and numerical operations, while typing and dataclasses provide structured type annotations and data containers. Utility modules, such as time and gc, support timestamping and memory cleanup, as well as tqdm.notebook offers interactive progress bars for chunk processing in Colab.

    Copy CodeCopiedUse a different Browser
    @dataclass
    class ContextChunk:
        """A chunk of text with metadata for the Model Context Protocol."""
        text: str
        embedding: Optional[torch.Tensor] = None
        importance: float = 1.0
        timestamp: float = 0.0
        metadata: Dict[str, Any] = None
       
        def __post_init__(self):
            if self.metadata is None:
                self.metadata = {}
            if self.timestamp == 0.0:
                self.timestamp = time.time()

    The ContextChunk dataclass encapsulates a single segment of text along with its embedding, a user-assigned importance score, a timestamp, and arbitrary metadata. Its __post_init__ method ensures that each chunk is stamped with the current time upon creation and that metadata defaults to an empty dictionary if none is provided.

    Copy CodeCopiedUse a different Browser
    class ModelContextManager:
        """
        Manager for implementing Model Context Protocol in LLMs on Google Colab.
        Handles context window optimization, token management, and relevance scoring.
        """
       
        def __init__(
            self,
            max_context_length: int = 8192,
            embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2",
            relevance_threshold: float = 0.7,
            recency_weight: float = 0.3,
            importance_weight: float = 0.3,
            semantic_weight: float = 0.4,
            device: str = "cuda" if torch.cuda.is_available() else "cpu"
        ):
            """
            Initialize the Model Context Manager.
           
            Args:
                max_context_length: Maximum number of tokens in context window
                embedding_model: Model to use for text embeddings
                relevance_threshold: Threshold for chunk relevance to be included
                recency_weight: Weight for recency in relevance calculation
                importance_weight: Weight for importance in relevance calculation
                semantic_weight: Weight for semantic similarity in relevance calculation
                device: Device to run computations on
            """
            self.max_context_length = max_context_length
            self.device = device
            self.chunks = []
            self.current_token_count = 0
            self.relevance_threshold = relevance_threshold
           
            self.recency_weight = recency_weight
            self.importance_weight = importance_weight
            self.semantic_weight = semantic_weight
           
            try:
                from sentence_transformers import SentenceTransformer
                print(f"Loading embedding model {embedding_model}...")
                self.embedding_model = SentenceTransformer(embedding_model).to(self.device)
                print(f"Embedding model loaded successfully on {self.device}")
            except ImportError:
                print("Installing sentence-transformers...")
                import subprocess
                subprocess.run(["pip", "install", "sentence-transformers"])
                from sentence_transformers import SentenceTransformer
                self.embedding_model = SentenceTransformer(embedding_model).to(self.device)
                print(f"Embedding model loaded successfully on {self.device}")
               
            try:
                from transformers import GPT2Tokenizer
                self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
            except ImportError:
                print("Installing transformers...")
                import subprocess
                subprocess.run(["pip", "install", "transformers"])
                from transformers import GPT2Tokenizer
                self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
       
        def add_chunk(self, text: str, importance: float = 1.0, metadata: Dict[str, Any] = None) -> None:
            """
            Add a new chunk of text to the context manager.
           
            Args:
                text: The text content to add
                importance: Importance score (0-1)
                metadata: Additional metadata for the chunk
            """
            with torch.no_grad():
                embedding = self.embedding_model.encode(text, convert_to_tensor=True)
           
            chunk = ContextChunk(
                text=text,
                embedding=embedding,
                importance=importance,
                timestamp=time.time(),
                metadata=metadata or {}
            )
           
            self.chunks.append(chunk)
            self.current_token_count += len(self.tokenizer.encode(text))
           
            if self.current_token_count > self.max_context_length:
                self.optimize_context()
       
        def optimize_context(self) -> None:
            """Optimize context by removing less relevant chunks to fit within token limit."""
            if not self.chunks:
                return
               
            print("Optimizing context window...")
           
            scores = self.score_chunks()
           
            sorted_indices = np.argsort(scores)[::-1]
           
            new_chunks = []
            new_token_count = 0
           
            for idx in sorted_indices:
                chunk = self.chunks[idx]
                chunk_tokens = len(self.tokenizer.encode(chunk.text))
               
                if new_token_count + chunk_tokens <= self.max_context_length:
                    new_chunks.append(chunk)
                    new_token_count += chunk_tokens
                else:
                    if scores[idx] > self.relevance_threshold * 1.5:
                        for i, included_chunk in enumerate(new_chunks):
                            included_idx = sorted_indices[i]
                            if scores[included_idx] < self.relevance_threshold:
                                included_tokens = len(self.tokenizer.encode(included_chunk.text))
                                if new_token_count - included_tokens + chunk_tokens <= self.max_context_length:
                                    new_chunks.remove(included_chunk)
                                    new_token_count -= included_tokens
                                    new_chunks.append(chunk)
                                    new_token_count += chunk_tokens
                                    break
           
            removed_count = len(self.chunks) - len(new_chunks)
            self.chunks = new_chunks
            self.current_token_count = new_token_count
           
            print(f"Context optimized: Removed {removed_count} chunks, {len(new_chunks)} remaining, using {new_token_count}/{self.max_context_length} tokens")
           
            gc.collect()
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
       
        def score_chunks(self, query: str = None) -> np.ndarray:
            """
            Score chunks based on recency, importance, and semantic relevance.
           
            Args:
                query: Optional query to calculate semantic relevance against
               
            Returns:
                Array of scores for each chunk
            """
            if not self.chunks:
                return np.array([])
               
            current_time = time.time()
            max_age = max(current_time - chunk.timestamp for chunk in self.chunks) or 1.0
            recency_scores = np.array([
                1.0 - ((current_time - chunk.timestamp) / max_age)
                for chunk in self.chunks
            ])
           
            importance_scores = np.array([chunk.importance for chunk in self.chunks])
           
            if query is not None:
                query_embedding = self.embedding_model.encode(query, convert_to_tensor=True)
                similarity_scores = np.array([
                    torch.cosine_similarity(chunk.embedding, query_embedding, dim=0).item()
                    for chunk in self.chunks
                ])
               
                similarity_scores = (similarity_scores - similarity_scores.min()) / (similarity_scores.max() - similarity_scores.min() + 1e-8)
            else:
                similarity_scores = np.ones(len(self.chunks))
           
            final_scores = (
                self.recency_weight * recency_scores +
                self.importance_weight * importance_scores +
                self.semantic_weight * similarity_scores
            )
           
            return final_scores
       
        def retrieve_context(self, query: str = None, k: int = None) -> str:
            """
            Retrieve the most relevant context for a given query.
           
            Args:
                query: The query to retrieve context for
                k: The maximum number of chunks to return (None = all relevant chunks)
               
            Returns:
                String containing the combined relevant context
            """
            if not self.chunks:
                return ""
               
            scores = self.score_chunks(query)
           
            relevant_indices = np.where(scores >= self.relevance_threshold)[0]
           
            relevant_indices = relevant_indices[np.argsort(scores[relevant_indices])[::-1]]
           
            if k is not None:
                relevant_indices = relevant_indices[:k]
               
            relevant_texts = [self.chunks[i].text for i in relevant_indices]
            return "nn".join(relevant_texts)
       
        def get_stats(self) -> Dict[str, Any]:
            """Get statistics about the current context state."""
            return {
                "chunk_count": len(self.chunks),
                "token_count": self.current_token_count,
                "max_tokens": self.max_context_length,
                "usage_percentage": self.current_token_count / self.max_context_length * 100 if self.max_context_length else 0,
                "avg_chunk_size": self.current_token_count / len(self.chunks) if self.chunks else 0,
                "oldest_chunk_age": time.time() - min(chunk.timestamp for chunk in self.chunks) if self.chunks else 0,
            }
    
    
        def visualize_context(self):
            """Visualize the current context window distribution."""
            try:
                import matplotlib.pyplot as plt
                import pandas as pd
               
                if not self.chunks:
                    print("No chunks to visualize")
                    return
               
                scores = self.score_chunks()
                chunk_sizes = [len(self.tokenizer.encode(chunk.text)) for chunk in self.chunks]
                timestamps = [chunk.timestamp for chunk in self.chunks]
                relative_times = [time.time() - ts for ts in timestamps]
                importance = [chunk.importance for chunk in self.chunks]
               
                df = pd.DataFrame({
                    'Size (tokens)': chunk_sizes,
                    'Age (seconds)': relative_times,
                    'Importance': importance,
                    'Score': scores
                })
               
                fig, axs = plt.subplots(2, 2, figsize=(14, 10))
               
                axs[0, 0].bar(range(len(chunk_sizes)), chunk_sizes)
                axs[0, 0].set_title('Token Distribution by Chunk')
                axs[0, 0].set_ylabel('Tokens')
                axs[0, 0].set_xlabel('Chunk Index')
               
                axs[0, 1].scatter(chunk_sizes, scores)
                axs[0, 1].set_title('Score vs Chunk Size')
                axs[0, 1].set_xlabel('Tokens')
                axs[0, 1].set_ylabel('Score')
               
                axs[1, 0].scatter(relative_times, scores)
                axs[1, 0].set_title('Score vs Chunk Age')
                axs[1, 0].set_xlabel('Age (seconds)')
                axs[1, 0].set_ylabel('Score')
               
                axs[1, 1].scatter(importance, scores)
                axs[1, 1].set_title('Score vs Importance')
                axs[1, 1].set_xlabel('Importance')
                axs[1, 1].set_ylabel('Score')
               
                plt.tight_layout()
                plt.show()
               
            except ImportError:
                print("Please install matplotlib and pandas for visualization")
                print('!pip install matplotlib pandas')

    The ModelContextManager class orchestrates the end-to-end handling of context for LLMs by chunking input text, generating embeddings, and tracking token usage against a configurable limit. It implements relevance scoring (combining recency, importance, and semantic similarity), automatic context pruning, retrieval of the most pertinent chunks, and convenient utilities for monitoring and visualizing context statistics.

    Copy CodeCopiedUse a different Browser
    class MCPColabDemo:
        """Demonstration of Model Context Protocol in Google Colab with a Language Model."""
       
        def __init__(
            self,
            model_name: str = "google/flan-t5-base",
            max_context_length: int = 2048,
            device: str = "cuda" if torch.cuda.is_available() else "cpu"
        ):
            """
            Initialize the MCP Colab demo with a specified model.
           
            Args:
                model_name: Hugging Face model name
                max_context_length: Maximum context length for the MCP manager
                device: Device to run the model on
            """
            self.device = device
            self.context_manager = ModelContextManager(
                max_context_length=max_context_length,
                device=device
            )
           
            try:
                from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
                print(f"Loading model {model_name}...")
                self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
                self.tokenizer = AutoTokenizer.from_pretrained(model_name)
                print(f"Model loaded successfully on {device}")
            except ImportError:
                print("Installing transformers...")
                import subprocess
                subprocess.run(["pip", "install", "transformers"])
                from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
                self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
                self.tokenizer = AutoTokenizer.from_pretrained(model_name)
                print(f"Model loaded successfully on {device}")
       
        def add_document(self, text: str, chunk_size: int = 512, overlap: int = 50) -> None:
            """
            Add a document to the context by chunking it appropriately.
           
            Args:
                text: Document text
                chunk_size: Size of each chunk in characters
                overlap: Overlap between chunks in characters
            """
            chunks = []
            for i in range(0, len(text), chunk_size - overlap):
                chunk = text[i:i + chunk_size]
                if len(chunk) > 20:  
                    chunks.append(chunk)
           
            print(f"Adding {len(chunks)} chunks to context...")
            for i, chunk in enumerate(tqdm(chunks)):
                pos = i / len(chunks)
                importance = 1.0 - 0.5 * min(pos, 1 - pos)
               
                self.context_manager.add_chunk(
                    text=chunk,
                    importance=importance,
                    metadata={"source": "document", "position": i, "total_chunks": len(chunks)}
                )
       
        def process_query(self, query: str, max_new_tokens: int = 256) -> str:
            """
            Process a query using the context manager and model.
           
            Args:
                query: The query to process
                max_new_tokens: Maximum number of tokens in response
               
            Returns:
                Model response
            """
            self.context_manager.add_chunk(query, importance=1.0, metadata={"type": "query"})
           
            relevant_context = self.context_manager.retrieve_context(query=query)
           
            prompt = f"Context: {relevant_context}nnQuestion: {query}nnAnswer:"
           
            inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
           
            print("Generating response...")
            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens,
                    do_sample=True,
                    temperature=0.7,
                    top_p=0.9,
                )
           
            response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
           
            self.context_manager.add_chunk(
                response,
                importance=0.9,
                metadata={"type": "response", "query": query}
            )
           
            return response
       
        def interactive_session(self):
            """Run an interactive session in the notebook."""
            from IPython.display import clear_output
           
            print("Starting interactive MCP session. Type 'exit' to end.")
            conversation_history = []
           
            while True:
                query = input("nYour query: ")
               
                if query.lower() == 'exit':
                    break
                   
                if query.lower() == 'stats':
                    print("nContext Statistics:")
                    stats = self.context_manager.get_stats()
                    for key, value in stats.items():
                        print(f"{key}: {value}")
                    self.context_manager.visualize_context()
                    continue
                   
                if query.lower() == 'clear':
                    self.context_manager.chunks = []
                    self.context_manager.current_token_count = 0
                    conversation_history = []
                    clear_output(wait=True)
                    print("Context cleared!")
                    continue
               
                response = self.process_query(query)
                conversation_history.append((query, response))
               
                print("nResponse:")
                print(response)
                print("n" + "-"*50)
               
                stats = self.context_manager.get_stats()
                print(f"Context usage: {stats['token_count']}/{stats['max_tokens']} tokens ({stats['usage_percentage']:.1f}%)")

    The MCPColabDemo class ties the context manager to a seq2seq LLM, loading FLAN-T5 (or any specified Hugging Face model) on the chosen device, and provides utility methods for chunking and ingesting entire documents, processing user queries by prepending only the most relevant context, and running an interactive Colab session complete with real-time stats, visualizations, and commands for clearing or inspecting the evolving context window.

    Copy CodeCopiedUse a different Browser
    def run_mcp_demo():
        """Run a simple demo of the Model Context Protocol."""
        print("Running Model Context Protocol Demo...")
       
        context_manager = ModelContextManager(max_context_length=4096)
       
        print("Adding sample chunks...")
       
        context_manager.add_chunk(
            "The Model Context Protocol (MCP) is a framework for managing context "
            "windows in large language models. It helps optimize token usage and improve relevance.",
            importance=1.0
        )
       
        context_manager.add_chunk(
            "Context management involves techniques like sliding windows, chunking, "
            "and relevance filtering to handle large documents efficiently.",
            importance=0.8
        )
       
        for i in range(10):
            context_manager.add_chunk(
                f"This is test chunk {i} with some filler content to simulate a larger context "
                f"window that needs optimization. This helps demonstrate the MCP functionality "
                f"for context window management in language models on Google Colab.",
                importance=0.5 - (i * 0.02)  
            )
       
        stats = context_manager.get_stats()
        print("nInitial Statistics:")
        for key, value in stats.items():
            print(f"{key}: {value}")
           
        query = "How does the Model Context Protocol work?"
        print(f"nRetrieving context for: '{query}'")
        context = context_manager.retrieve_context(query)
        print(f"nRelevant context:n{context}")
       
        print("nVisualizing context:")
        context_manager.visualize_context()
       
        print("nDemo complete!")

    The run_mcp_demo function ties everything together in a single script: it instantiates the ModelContextManager, adds a series of sample chunks with varying importance, prints out initial statistics, retrieves and displays the most relevant context for a test query, and finally visualizes the context window, providing a complete, end-to-end demonstration of the Model Context Protocol in action.

    Copy CodeCopiedUse a different Browser
    if __name__ == "__main__":
        run_mcp_demo()
    

    Finally, this standard Python entry-point guard ensures that the run_mcp_demo() function executes only when the script is run directly (rather than imported as a module), triggering the end-to-end demonstration of the Model Context Protocol workflow.

    In conclusion, we will have a fully functional MCP system that not only curbs runaway token usage but also prioritizes context fragments that truly matter for your queries. The ModelContextManager equips you with tools to balance semantic relevance, temporal freshness, and user-assigned importance. At the same time, the accompanying MCPColabDemo class provides an accessible framework for real-time experimentation and visualization. Armed with these patterns, you can extend the core principles by adjusting relevance thresholds, experimenting with various embedding models, or integrating with alternative LLM backends to tailor your domain-specific workflows. Ultimately, this approach enables you to create concise yet highly relevant prompts, resulting in more accurate and efficient responses from your language models.


    Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post A Coding Tutorial of Model Context Protocol Focusing on Semantic Chunking, Dynamic Token Management, and Context Relevance Scoring for Efficient LLM Interactions appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleRole of AI-driven Autonomous Testing in Software QA
    Next Article Devin AI Introduces DeepWiki: A New AI-Powered Interface to Understand GitHub Repositories

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    I’m a Linux power user, and the latest Ubuntu update put a smile on my face

    News & Updates

    CVE-2025-27700 – “Qualcomm Carrier Restrictions Bypass Local Privilege Escalation”

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-49015 – “Couchbase .NET SDK TLS Hostname Verification Weakness”

    Common Vulnerabilities and Exposures (CVEs)

    Supplier order management: How a furniture retailer automated order confirmation processing

    Artificial Intelligence

    Highlights

    CVE-2025-35005 – Microhard BulletLTE-NA2 and IPn4Gii-NA2 AT+MFMAC Command Injection Vulnerability

    June 8, 2025

    CVE ID : CVE-2025-35005

    Published : June 8, 2025, 9:15 p.m. | 38 minutes ago

    Description : Products that incorporate the Microhard BulletLTE-NA2 and IPn4Gii-NA2 are vulnerable to a post-authentication command injection issue in the AT+MFMAC command that can lead to privilege escalation. This is an instance of CWE-88, “Improper Neutralization of Argument Delimiters in a Command (‘Argument Injection’),” and is estimated as a CVSS 7.1 (CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:N). This issue has not been generally fixed at the time of this CVE record’s first publishing.

    Severity: 7.1 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    How to Build Your Own Local AI: Create Free RAG and AI Agents with Qwen 3 and Ollama

    May 6, 2025

    CVE-2025-40909: Perl Threads Vulnerability Exposes File Operation Race Condition

    May 31, 2025

    CVE-2025-50057 – RSFiles! Denial of Service (DOS) Vulnerability

    July 18, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.