- Learn how byte-level models process raw UTF-8 bytes directly, with a vocabulary size of 256
- Understand how this approach removes the need for subword tokenizers like BPE or SentencePiece
- Compare byte-level models to tokenized models with larger vocabularies (e.g., 30k–50k tokens)
- Analyze the trade-offs between the two approaches in terms of simplicity
- Evaluate how each approach handles multilingual text
- Assess the impact on model size
- Examine differences in performance

Tokenization deep dive - Byte-level language modeling vs traditional tokenization

- Explore decoding strategies that influence LLM output diversity and fluency
- Top-k sampling
  - Learn how Top-k sampling truncates the output distribution to the k most likely tokens (e.g., k=16)
  - Understand how Top-k sampling balances creativity and control, and why it’s especially effective with small vocab sizes like byte-level models

- Nucleus (Top-p) sampling
  - Learn how Nucleus (Top-p) sampling dynamically includes tokens up to a cumulative probability p (e.g., p=0.9)
  - Understand how Top-p sampling produces more adaptive and coherent completions than Top-k, especially in unpredictable generation tasks

- Beam search
  - Learn how Beam search keeps multiple candidate completions in parallel and scores them to select the most likely overall path
  - Understand why Beam search is useful for deterministic outputs (e.g., code, structured data) and why it can lead to repetitive or bland completions in open-ended generation

- Speculative decoding (OpenAI-style)
  - Learn how Speculative decoding speeds up inference by letting a small model propose multiple token candidates in parallel, which a larger model verifies
  - Understand how speculative decoding works internally and why it is gaining popularity in production systems like Groq and OpenAI APIs

State-of-the-art decoders

- Run the same input prompt using Top-k, Top-p, and Beam search decoding
- Measure differences in diversity, accuracy, repetition, and latency across the methods
- Discuss which strategy works best for each context and explain why

Mini-lab - Compare decoding methods on a complex prompt

Byte-Level Models & Sampling Decoders

- Learn how token generation in LLMs can be framed as a Markov process
- Understand the key components of an MDP
- Understand how these map conceptually to autoregressive decoding

Markov Decision Processes (MDP) as LLM analogies

- Explore the Monte Carlo and TD methods of learning from sequences

Monte Carlo vs Temporal Difference (TD) learning

- Learn the concept of Q-learning as a method to estimate how good an action (token) is in a specific context (prompt state)
- Learn the concept of Policy gradients as a method to directly optimize the probability distribution over actions to maximize long-term reward
- Understand how Q-learning and Policy gradients form the basis of RLHF, DPO, and advanced training techniques for aligning LLM behavior

Q-learning & Policy Gradients (conceptual overview)

- Understand how RL ideas are used without training by introducing dynamic feedback in inference
  - Apply reward scoring or confidence thresholds to adjust CoT (Chain-of-Thought) reasoning steps
  - Use external tools (e.g., validators or search APIs) as part of a feedback loop that rewards correct or complete answers
  - Understand how RL concepts power speculative decoding verification, scratchpad agents, and dynamic rerouting during generation

RL in decoding, CoT prompting, and feedback loops

Markov Chains & Reinforcement Learning Foundations

- Learn how to modularize retrieval into topic- or task-specific “cartridges.”
- Understand that cartridges are pre-distilled context sets for self-querying agents
- Study how this approach is inspired by OpenAI’s retrieval plugin and LangChain’s retriever routers
- See how cartridges improve retrieval precision by narrowing memory to high-relevance windows

Cartridge-based retrieval (self-study distillation)

- Study late interaction architectures (like ColQwen-Omni) that separate dense retrieval from deep semantic fusion
- Explore how these models support chunking and retrieval over image, audio, and video-text combinations using attention-based fusion at scoring time

Late interaction methods (ColQwen-Omni, audio+image chunks)

- Understand how multi-vector databases (e.g., ColBERT, Turbopuffer) store multiple vectors per document to support fine-grained relevance
- Contrast this with standard single-vector-per-doc retrieval (e.g., FAISS), and learn when multi-vector setups are worth the extra complexity

Multi-vector DB vs standard DB

- Implement index routing systems where queries are conditionally routed:
  - short factual query → lexical index
  - long reasoning query → dense retriever
  - visual question → image embedding index

- Learn how to fuse local memory with global vector stores for agentic long-term retrieval

Query routing logic and memory-index hybrids

- Compare the two core objectives used for fine-tuning retrievers
- Understand how each behaves in hard-negative-rich domains like code or finance

Contrastive loss vs triplet loss

- Explore the architectural trade-offs between Bi/tri-encoders vs cross-encoders
- Learn when to use hybrid systems (e.g., bi-encoder retrieval + cross-encoder reranking)

Tri-encoder vs cross-encoder performance trade-offs

- Dive into triplet formation strategies
- Focusing on how to find semi-hard negatives (similar but incorrect results that challenge the model)

Triplet-loss fundamentals and semi-hard negative mining

- Learn to use off-the-shelf rerankers like Cohere’s API or fine-tune SBERT models to optimize document ranking post-retrieval

Cohere Rerank API & SBERT fine-tuning ([sbert.net], Hugging Face)

- Implement pipelines that automatically surface confusing negatives

Hard-negative mining strategies

Advanced Retrieval Methods

In this bootcamp, we go beyond prompt engineering to give you the skills to design, build, and optimize advanced AI systems that can adapt and improve over time. You will learn to think like a systems engineer, mastering the underlying mechanics of modern models and the techniques that make them perform in real-world, high-stakes environments.

Over several intensive weeks, we combine: 

Deep technical instruction with hands-on coding projects to bridge the gap between theory and deployment. You’ll work directly with production-grade frameworks, simulate complex reasoning behaviors, and build AI pipelines that integrate multiple data types and modalities. Every concept is reinforced through practical exercises, live feedback, and real-world project reviews, ensuring that by the end, you can not only understand advanced AI architectures but also architect, deploy, and refine them for evolving enterprise needs. This program includes in-depth instruction, dedicated mentorship, and exclusive access to tools, templates, and a collaborative community to support your continued growth.

Your expert guides through this bootcamp are:

[Dr. Dipen Bhuva](https://www.linkedin.com/in/dipen-bhuva-21a296158/): Dr. Dipen is an AI/ML researcher with 150+ citations and 16 published research papers. He has 3 tier-1 publications, including Internet of Things (Elsevier), Biomedical Signal Processing and Control (Elsevier), and IEEE Access. In his research journey, he has collaborated with NASA-Glenn Centre, Cleveland Clinic, and the US department of energy for his research papers. He was an official reviewer and has reviewed 100+ research papers from Elsevier, IEEE Transactions, ICRA, MDPI, and other top journals and conferences. He has a PhD from Cleveland State University with a focus on LLMs in cybersecurity. He also has a master's in informatics at Northeastern University.

[Zao Yang](https://www.linkedin.com/in/zaoyang/): Zao is the owner of Newline, a platform used by 150k professionals from companies like Salesforce, Adobe, Disney, and Amazon. Zao has a rich history in the tech industry, co-creating Farmville (200 million users, $3B revenue) and Kaspa ($3B market cap). Self-taught in deep learning, generative AI, and machine learning, Zao is passionate about empowering others to develop practical AI applications. His extensive knowledge of both the technical and business sides of AI projects will be invaluable as you work on your own.

With Dipen and Zao's guidance, you’ll gain practical insights into building and deploying advanced AI models, preparing you for the most challenging and rewarding roles in the AI field.

AI engineering in the enterprise

This advanced AI Bootcamp teaches you to design, debug, and optimize full-stack AI systems that adapt over time. You will master byte-level models, advanced decoding, and RAG architectures that integrate text, images, tables, and structured data. You will learn multi-vector indexing, late interaction, and reinforcement learning techniques like DPO, PPO, and verifier-guided feedback. Through 50+ hands-on labs using Hugging Face, DSPy, LangChain, and OpenPipe, you will graduate able to architect, deploy, and evolve enterprise-grade AI pipelines with precision and scalability.

AI bootcamp 2

Ability to architect and deploy advanced AI systems for enterprise and consulting, worth $100k in annual value ($1M over 10 years)

Skills in advanced RAG, RLHF, and RL-based fine-tuning that are rare and in high demand at top AI companies

Capability to build multimodal, feedback-driven AI pipelines that outperform vanilla retrieval and generation systems

Technical mastery to lead or consult on AI platform engineering, unlocking six-figure consulting opportunities

Portfolio of 50+ hands-on labs and multiple enterprise-grade AI projects to showcase to employers or clients

Direct code reviews, debugging help, and system design feedback from expert instructors

Future-proof understanding of AI system architecture, enabling you to adapt to new models and frameworks as they emerge

A competitive advantage in the AI job market, with potential to increase earnings by $50k+ annually

What You Will Gain

Master byte-level language models and advanced decoding strategies like top-k, nucleus, and speculative decoding

Design and optimize advanced Retrieval-Augmented Generation (RAG) pipelines that integrate text, images, tables, and structured data

Implement multi-vector indexing, late interaction methods, and metadata-aware query routing

Fine-tune retrievers and rerankers using contrastive loss, triplet loss, and hard-negative mining

Simulate and enhance reasoning in non-CoT models using feedback loops and model control patterns

Develop tool-using agents with multi-hop planning, disambiguation flows, and function-calling capabilities

Build feedback-driven evaluation pipelines using persona-based synthetic queries, regex/schema validators, and topic clustering

Learn and apply reinforcement learning techniques including DPO, PPO, GPRO, RLVR, and verifier-guided optimization

Integrate open-source frameworks like Hugging Face, DSPy, LangChain, OpenPipe, and Braintrust into production-grade systems

Deploy enterprise-ready AI systems with robust evaluation, monitoring, and continuous improvement loops

Architect modular, future-proof AI pipelines that can evolve with new models, tools, and retrieval methods

---
title: Tokenization deep dive - Byte-level language modeling vs traditional tokenization
privateVideoUrl:
isPublicLesson: true
description: >-
  - Learn how byte-level models process raw UTF-8 bytes directly, with a vocabulary size of 256

  - Understand how this approach removes the need for subword tokenizers like BPE or SentencePiece