from series, words and their mappings.

the llm era: from embeddings to emergence

"All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available."

This quote is from the paper ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. (At least, we know what did Ilya see in 2012.)

In the previous two posts of the series Words and Their Mappings, I briefly introduced LLMs, and journey of embeddings up until BERT and GPT. In this post, we will explore our near past and today.

Foundations

In the era following BERT and GPT, categorically different kind of NLP emerged. I personally witnessed it during my years at Turkcell, working at a search engine project. Rapidly, new language models using transformer architecture are published. At these days, BERT and GPT used to be considered as large language models.

Researchers was striving to address the limitations of these early transformer-based models while exploring new horizons in language understanding, reasoning, and representation. Despite these efforts, skepticism toward large language models was widespread. Many viewed the field as a potential dead-end, constrained by computational costs and diminishing returns. Yet, this skepticism was met with what Richard Sutton famously termed The Bitter Lesson, a paradigm shift driven by scaling models, data, and computational resources. (If you haven't already, I highly recommend reading Sutton’s article. While I am skeptical about its details and extent, its core argument is compelling and valuable.)

Let us this time look into the foundational milestones that set the stage for the transformative capabilities of LLMs today.

After BERT and GPT

As I mentioned before, when BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) were introduced, they categorically changed the field of NLP. BERT excelled in understanding tasks, leveraging bidirectional context to generate embeddings that reflected the full surrounding context of each word. GPT, on the other hand, focused on generative tasks, using autoregressive modeling to predict text sequences fluently.

Despite their important achievements, significant challenges remained:

Static Fine-Tuning: Both BERT and GPT required task-specific fine-tuning, a process that could be resource-intensive and limited the adaptability of the models across a diverse range of tasks.

Single-Task Optimization: BERT and GPT were optimized for narrow objectives—BERT for masked language modeling and next-sentence prediction, and GPT for autoregressive language generation—resulting in limited versatility for tasks.

Data Inefficiency: Both models were trained with a pretraining-finetuning paradigm that lacked the flexibility to dynamically adapt to new data or tasks without retraining.

Contextual and Conceptual Gaps: While both models improved contextual understanding, they still struggled with tasks requiring deep reasoning, long-term dependencies, or representations of abstract concepts.

Scalability and Performance: Early transformer-based models hinted at the potential of larger architectures but were constrained by the computational resources available at the time.

These limitations prompted researchers to explore new architectures, training paradigms, and optimization strategies to expand the capabilities of transformers.

T5: A Unified Framework

What if we approach to all of the NLP tasks as text-to-text tasks?

Introduced by Google Research in 2020, T5 (Text-to-Text Transfer Transformer) by Raffel et al. represented a significant leap forward. It proposed a unifying framework that treated all NLP tasks as text-to-text problems. In this paradigm, both input and output were expressed as strings, enabling a consistent and versatile approach to diverse tasks.

By framing tasks such as translation, summarization, classification, and even question answering as text-to-text transformations, T5 simplified the NLP pipeline. Instead of maintaining

from T5 paper

At the first time, tasks were given as prompts. Although the prompts were fixed, this could be considered as one of the inspirations behind idea of prompt engineering.

Also, another important thing happened. Recall that we talked about contextual embeddings before. For example, "plane" as a vehicle and "plane" as a surface represented differently in the contextual embeddings generated by BERT and GPT. Now, this context in contextual embeddings became task-specific embedding. The embeddings started to represent tasks themselves in the latent space.

In addition to all of that, T5 introduced a new training method. Instead of BERT's single token masked language modeling (MLM), contiguous sequences of tokens were masked and replaced. This method was called span corruption. By doing this, model became more inclined to develop embeddings that encoded multi-word phrases. As you can guess, many concepts (such as "artificial intelligence" or "large language models") are represented as multi-word phrases in the natural language. So, this allowed better concept representations in the latent space.

from T5 paper

Moreover, using multi-word phrases as embeddings allowed denser representations. This provided an additional performance boost and capturing broader conceptual information.

T5 also leveraged techniques like beam search, which systematically explored multiple output sequences to ensure grammatical accuracy and fluency. This made T5 particularly effective in structured tasks like translation and summarization, where coherence and precision were the first priorities.

However, T5 had several problems:

"Prompts" were static: The model couldn't adapt to nuanced variations in task instructions. For instance, minor rewordings of a prompt might confuse the model or degrade performance.

Overfitting to Masking Objective: By focusing on reconstructing masked spans during training, the model occasionally struggled with tasks requiring longer-range dependencies or deeper reasoning, as it prioritized reconstructive accuracy over broader generalization.

Task-specific fine-tuning limitations: While T5 excelled in multi-task learning, it still required fine-tuning for many downstream tasks. Generalizing to entirely unseen tasks often necessitated additional training or adjustments.

Less general-purpose embeddings: Embeddings became heavily tied to the tasks they were trained on, potentially reducing generalizability to new or unseen tasks.

Fixed input lengths: Long documents or conversations needed to be truncated or split, which could lead to loss of context and degraded performance.

Loss of global understanding: While T5 was effective for localized tasks, maintaining coherence across longer texts or multi-turn dialogues remained challenging.

GPT-2: Beyond GPT

Released by OpenAI in 2019, mentioned in paper GPT-2 marked a significant leap forward in generative language modeling. It built upon the foundations of GPT (2018) while addressing several challenges of earlier transformer-based models like T5 and BERT.

As T5 framed all tasks as text-to-text, GPT-2 framed all tasks as next-word prediction. In principle, GPT-2 demonstrated the potential of language models to perform tasks without task-specific fine-tuning. It could handle a variety of NLP tasks such as translation, summarization and question-answering simply by interpreting task instructions in natural language.

To enhance generative diversity and flexibility, GPT-2 introduced advanced decoding techniques. Among these was temperature scaling, which adjusted the model’s confidence in selecting the next token. Lower temperatures focused on high-probability outputs, producing deterministic and coherent responses. Higher temperatures, by contrast, allowed the model to explore less probable tokens, fostering creativity at the risk of incoherence.

Alongside temperature scaling, GPT-2 employed top-k sampling and nucleus sampling (top-p) to further refine text generation, enabling creative, contextually appropriate outputs while mitigating repetitiveness.

BERT and GPT had task-agnostic embeddings, T5 had task-specific embeddings, GPT-2 had task-adaptive embeddings.

Main difference between GPT and GPT-2 was scale. Larger parameters, and larger training data. While GPT had 117 million parameters and was trained on a 7GB dataset (BookCorpus), GPT-2 had 1.5 billion parameters and was trained on a much larger, 45GB dataset.

One can say "so what, better next-word prediction?", but scale is more than just scale.

Emergence, Scaling Laws and Zero-Shot Learning

The jump from GPT to GPT-2 was more than a quantitative increase in scale; it was a qualitative shift in what language models could achieve. As the size of the model and its training data increased, something remarkable happened: emergent capabilities began to surface.

Before continuing, let me talk about the concept of emergence.

Emergence is a fundamental idea in complexity science, a field (one of my favourite fields!) that explores how simple rules at a lower level can give rise to intricate and unexpected phenomena at a higher level. In essence, emergence refers to the appearance of behaviors, properties, or patterns in a system that cannot be directly inferred from its individual components.

For example, think of a flock of birds. Each bird follows simple local rules: maintain distance from neighbors, align direction, and avoid collisions. Yet, as a collective, the flock exhibits graceful, dynamic patterns that seem to have a mind of their own. The behavior of the flock as a whole emerges from the interactions between the birds and their environment.

Similarly, in language models, emergence was seen in their ability to perform tasks it had not been explicitly trained for.

For instance:

Zero-Shot Learning: GPT-2 could generalize to unseen tasks simply by interpreting natural language prompts. This was not the result of task-specific fine-tuning but rather an inherent property of the model’s training on diverse, large-scale datasets.

Coherent Text Generation: Unlike earlier models that produced fragmented or nonsensical output, GPT-2 could generate coherent, contextually appropriate text over long passages.

Semantic Understanding: While still imperfect, GPT-2 demonstrated an ability to capture deeper relationships between words and concepts, allowing it to infer meanings and generate plausible answers to open-ended questions.

What makes these capabilities emergent is that they were not directly programmed into the model or obvious outcomes of its design. Instead, they arose as by-products of scaling up the model and training it on diverse, interconnected datasets.

Scaling Laws

The concept of emergence led researchers to ask: Why does scale lead to these transformative behaviors? And, more importantly, how much further could scale take us?

In 2020, OpenAI published the seminal paper Scaling Laws for Neural Language Models by Kaplan et al., which systematically studied how the performance of language models improved as three key factors were scaled:

Model Size: The number of parameters in the network.

Dataset Size: The volume of training data.

Compute Power: The computational resources used for training.

The findings were groundbreaking:

Performance Follows Power Laws: The improvements in model performance were found to follow predictable power-law trends as model size, dataset size, and compute power increased. These trends provided a roadmap for building better models: scale up, and performance will improve.

No Plateau in Sight: The research indicated that larger models trained on larger datasets continued to deliver gains, defying the traditional ML concerns of overfitting. Instead, larger

Diminishing Returns but Justifiable Costs: While returns diminished at extreme scales, the gains remained meaningful, justifying investments in ever-larger architectures.

The Scaling Laws paper established a theoretical framework for the rapid advancements that followed. It showed that scaling wasn’t just a brute-force approach—it was a principle-driven strategy.

The implications of these findings were extremely important. They validated the notion that as models grow larger and datasets more diverse, new capabilities naturally emerge.

For instance:

GPT-2’s zero-shot learning capabilities became more pronounced as its scale increased.

Larger models developed richer latent spaces, where representations of words and concepts encoded nuanced, multidimensional relationships.

The model’s ability to generate coherent, contextually appropriate text over long passages improved as it scaled.

So, how far would these capabilities go?

GPT-3

Released in 2020, GPT-3 (Generative Pre-trained Transformer 3) was OpenAI’s answer to the lessons of scaling and emergence. With 175 billion parameters, and larger training data, GPT-3 far exceeded its predecessor GPT-2 (1.5 billion parameters) and started a new era in natural language processing.

Scaling GPT-3 revealed better emergent properties:

Few-Shot and Zero-Shot Learning: GPT-3 could adapt to tasks with minimal instruction. By providing a few examples (few-shot) or even just a well-crafted natural language prompt (zero-shot), GPT-3 demonstrated surprising proficiency in tasks like translation, summarization, and creative writing. This flexibility minimized the need for task-specific fine-tuning.

Reasoning and Problem Solving: Tasks that traditionally required specialized models including coding mathematical reasoning became accessible to GPT-3 through well-designed prompts. This ability to generalize across domains was a testament to the power of its scaled latent space.

Longer Contextual Understanding: With an increased capacity to handle larger contexts, GPT-3 excelled at tasks requiring reasoning over extended passages. It could write coherent multi-paragraph essays, hold extended conversations, and even perform complex reasoning tasks.

Better Conceptual Representations: GPT-3’s embeddings became richer and more nuanced, allowing it to represent abstract concepts, multi-word phrases, and task-specific relationships in a multidimensional latent space. This was evident in its ability to generate plausible responses to open-ended philosophical questions, technical problem-solving prompts, and creative storylines.

Prompt Engineering

Prompt engineering is more than just crafting clever sentences. It is a semiscientific, or even scientific approach to manipulating the probabilistic underpinnings of large language models (LLMs) to achieve desired outcomes. It is more than just "please explain step-by-step".

At its core, prompt engineering harnesses the statistical and structural representations encoded in embeddings and latent space, enabling LLMs to respond with remarkable precision across a diverse range of tasks.

To understand why prompt engineering is effective, we must look at the mechanisms underlying modern LLMs. These models operate on three core principles:

During training, LLMs process vast amounts of text to capture statistical relationships between tokens (words, subwords, or characters). This allows them to predict the next token in a sequence with high accuracy, forming the basis of their generative capabilities.

Every token, phrase, or concept is represented as a dense vector in a high-dimensional latent space. This latent space is structured such that semantically similar words (e.g., "king" and "queen") are positioned close to each other, while dissimilar words (e.g., "king" and "fish") are farther apart. These embeddings encode both syntactic and semantic information. Let's remember: tasks themselves are also encoded into the latent spce.

Also, the transformer architecture dynamically adjusts token embeddings based on their surrounding context. For example, the word “plane” as a surface and plane as a vehicle will occupy different regions of latent space due to their distinct contexts.

When you design a prompt, you effectively guide the traversal of the model through latent space. By framing a question or task in a specific way, you nudge the model toward regions of latent space where the embeddings are likely to yield the desired response.

At the heart of prompt engineering lies the conditional probabilities, where the model predicts the next token given a sequence of previous tokens. A well-engineered prompt maximizes the probability that the desired sequence will align with the model’s learned patterns.

For example, the query “Translate the following sentence into French: Words, my colonel, do not reach certain meaning.” increases the probability of generating “Les mots, mon colonel, n'atteignent pas certains sens." because the phrase “Translate the following sentence into French” strongly activates regions of latent space associated with translation tasks.

Similarly, if the data and model necessarily large and complex, the same case applies for mathematical or coding tasks.

When designing prompts, words act as notations, symbols that guide the model's traversal through latent space, activating regions where learned relationships and patterns reside. LLMs do not "learn words"; they learn the world encoded in the relationships between words and contexts. The text they process is a proxy for the underlying structure of reality, enabling models to approximate the patterns, concepts, and associations within the data.

Now, let's look at some of the prompt engineering techniques, and gain intuition about why and how do they work.

Some Popular Prompt Engineering Techniques and Their Probabilistic Behaviour

The evolution of prompt engineering has seen the development of systematic approaches that leverage the model’s embeddings and latent space more effectively. Here are some key techniques:

Zero-Shot Prompting

Zero-shot prompting involves directly instructing the model to perform a task using natural language without providing examples. This relies on the model's pre-trained knowledge and its ability to generalize from its latent space representation.

Example:

> Write a function that obtains n random samples from a Qdrant vector database

In zero-shot prompting, the prompt directly narrows the vast latent space to regions associated with the task, relying on patterns encoded during pretraining. The model aligns its probabilities with general-purpose task knowledge.

The prompt serves as an instruction without examples. Words in the prompt (e.g., “write code” or “summarize”) act as high-level markers that activate regions of latent space where similar tasks are encoded. Probabilities are adjusted based on data patterns, increasing the likelihood of generating a correct result.

Few-Shot/N-Shot Prompting

Few-shot prompting provides a small number of input-output examples to prime the model. This reduces ambiguity and improves task performance by anchoring the model in the desired latent region.

Example:

Extract the company name, date, and total revenue from the following financial summaries: Summary: The quarterly earnings report for XYZ Corp., dated August 15, 2023, revealed a total revenue of $45 million for Q2. Company Name: XYZ Corp. Date: August 15, 2023 Total Revenue: $45 million Summary: On September 10, 2022, Alpha Industries announced its yearly revenue of $3.5 billion in its annual financial report. Company Name: Alpha Industries Date: September 10, 2022 Total Revenue: $3.5 billion Summary: BlueTech Inc. published its financial statement for Q1 on May 3, 2023, showing a revenue of $120 million.

Model Output:

Company Name: BlueTech Inc. Date: May 3, 2023 Total Revenue: $120 million

Now, suppose the task "Extract the company name, date, and total revenue" is less likely to be encoded in the latent space of the LLM. However, by providing a template or a pattern, we guide the model toward creating a localized pathway in the latent space that aligns with the task. This template acts as a guideline, allowing the model to infer relationships between the input text and the desired output structure.

By repeating the pattern across examples, the model builds a probabilistic structure where the likelihood of following the specified format increases significantly. Even if the task is not pre-encoded, the prompt conditions the model to behave as though it understands the task intrinsically.

Furthermore, the latent space traversal is adjusted dynamically:

The initial input text activates embeddings related to finance and temporal information.

The template acts as a guidepost, filtering out irrelevant embeddings and concentrating the model's focus on specific aspects of the input.

As the output tokens are predicted, the model continually refines its context, staying aligned with the established template.

This demonstrates the flexibility of few-shot prompting to "teach" the model tasks that are not explicitly encoded, leveraging its generalization ability and probabilistic structure to perform domain-specific operations.

Chain of Thought Prompting

Chain-of-thought (CoT) prompting encourages a language model to generate intermediate reasoning steps before arriving at a final answer. This technique leverages the model's ability to simulate structured thought processes, improving performance on tasks requiring logical reasoning, multi-step computation, or contextual dependencies.

Example:

Task: Explain the cause and process of photosynthesis in plants. Explain photosynthesis by breaking it into steps: 1. Identify the primary purpose of photosynthesis. 2. Describe the inputs required for photosynthesis. 3. Explain the process that occurs during photosynthesis. 4. State the outputs of photosynthesis.

Model Output:

1. The primary purpose of photosynthesis is to convert light energy into chemical energy for the plant's use. 2. The inputs required are sunlight, carbon dioxide, and water. 3. During photosynthesis, chlorophyll in the plant's cells absorbs sunlight, and the plant uses this energy to convert carbon dioxide and water into glucose and oxygen. 4. The outputs are glucose, which serves as energy for the plant, and oxygen, which is released into the atmosphere. Answer: Photosynthesis is the process by which plants convert light energy into chemical energy, using sunlight, carbon dioxide, and water to produce glucose and oxygen.

CoT prompting modifies the conditional probability distribution by embedding reasoning steps within the sequence. Each step acts as contextual input for the subsequent prediction. The model is guided through a structured traversal of latent space, ensuring that intermediate outputs logically align with the final answer.

The phrase "break it into steps" signals the model to activate embeddings associated with reasoning and problem-solving rather than directly generating an answer. This shifts the latent space traversal toward regions representing logical relationships and intermediate computations.

In each step, the model auto-converges to the result.

Few-Shot Chain-of-Thought Prompting

Few-shot CoT combines the benefits of few-shot examples and CoT reasoning. By including worked-out examples with intermediate reasoning steps, this technique provides a robust template for complex tasks.

Example:

Example 1: Q: If a train travels 60 miles per hour for 2 hours, how far did it travel? A: Step 1: Identify the speed: 60 miles per hour. Step 2: Identify the time: 2 hours. Step 3: Multiply speed by time: 60 × 2 = 120 miles. Answer: 120 miles. Example 2: Q: A cyclist rides 20 miles per hour for 4 hours. How far did they ride? A: Step 1: Identify the speed: 20 miles per hour. Step 2: Identify the time: 4 hours. Step 3: Multiply speed by time: 20 × 4 = 80 miles. Answer: 80 miles. Q: A car drives 50 miles per hour for 3 hours. How far did it travel? A:

Model Output:

Step 1: Identify the speed: 50 miles per hour. Step 2: Identify the time: 3 hours. Step 3: Multiply speed by time: 50 × 3 = 150 miles. Answer: 150 miles.

Reframing Prompts

Reframing involves rewording or adding context to a prompt to make the task more aligned with patterns the model recognizes. This technique works well when initial outputs are suboptimal due to ambiguity or the model misinterpreting the task.

Example:

Instead of:

Summarize this article in one sentence.

Reframe as:

Imagine you are summarizing this article for a 5th grader. Write one simple sentence summarizing the article.

Reframing activates latent space embeddings related to specific contexts (e.g., simplifying for a younger audience). Probabilistically, the additional instruction narrows the output distribution to align with the desired tone or complexity.

Prompt Cascading

Prompt cascading involves chaining multiple prompts together to handle complex, multi-stage tasks. Instead of doing it in one prompt like in CoT, output of a stage is used as input for the subsequent stage, enabling iterative refinement.

Example:

1. Step 1: Summarize the article in three bullet points: - Output: - 1. Discusses climate change impacts. - 2. Highlights potential solutions. - 3. Mentions government policies. 2. Step 2: Expand on the second bullet point with examples.

Each step dynamically updates the latent context, progressively narrowing the model’s focus. Probabilistically, cascading breaks a complex task into simpler sub-tasks, reducing the risk of logical errors or omissions.

This is by no means a comprehensive list, and it omits one of the most important methods that caused a paradigm shift in utilizing LLMs: Retrieval Augmented Generation (RAG). I intend to cover it in a separate post.

After Scaling Laws

The advancements following GPT-3 represented not just incremental improvements but significant strides toward aligning large language models with human needs and preferences. Central to this evolution was the introduction of InstructGPT, a breakthrough that shifted the focus from raw generative capability to human alignment and usability.

InstructGPT

Released in early 2022, InstructGPT by OpenAI addressed one of GPT-3’s most prominent challenges: its tendency to produce verbose, irrelevant, or unhelpful responses. While GPT-3 showcased remarkable generative capabilities, it lacked the nuance to consistently align its outputs with user intentions. It was the beginning of the Reinforcement Learning from Human Feedback (RLHF), a methodology that revolutionized how LLMs could be trained to prioritize meaningful, contextually relevant responses.

The mechanism is basically as follows:

Human Feedback: A team of human labelers ranked multiple outputs generated by the model for a variety of prompts. These rankings provided a basis for training a reward model.

Fine-Tuning via RLHF: The reward model guided the optimization process, aligning the LLM's outputs with human preferences.

Instruction Tuning: Datasets containing prompts and their desired outputs were curated to train the model to better understand and respond to user instructions.

By PopoDameron - Own work, CC BY-SA 4.0, Link

This process resulted in a model capable of aligning its behavior more closely with human intent. Notably, human preferences became encoded in the model’s latent space, subtly guiding its responses toward greater relevance, safety, and usability. The model not only reduced harmful or biased outputs but also became more adept at interpreting nuanced instructions.

InstructGPT laid the groundwork for a significant leap forward in user-centric design, setting the stage for GPT-3.5 and ChatGPT.

Meanwhile, Google advanced the field with models like Flan-T5 and LaMDA, demonstrating how instruction-tuned architectures could achieve both alignment and versatility:

LaMDA (Language Model for Dialogue Applications) (2021): Google introduced this model to excel in open-domain dialogue, emphasizing fluid, conversational AI that could handle nuanced, context-aware interactions.

Flan-T5 (Foundation Language and Arithmetic Transformer) (2022): Google's Flan-T5 model was designed to excel in arithmetic and language tasks, showcasing its ability to handle complex reasoning and mathematical operations.

GPT-3.5

Released later by OpenAI again in 2022, GPT-3.5 built upon the advancements of InstructGPT, refining instruction-following capabilities and expanding the model's versatility. It incorporated larger datasets, iterative RLHF improvements, and advanced fine-tuning techniques.

As expected from scaling laws, this iteration demonstrated:

Improved Conversational Fluency: Responses became more natural and engaging, with noticeable improvements in coherence and tone.

Generalization to Unseen Tasks: GPT-3.5 extended the zero-shot and few-shot learning capabilities of GPT-3, showcasing remarkable adaptability to a wide array of user prompts.

Enhanced Reasoning: Multi-step reasoning tasks and problem-solving saw incremental gains, although limitations in logical rigor and factual accuracy persisted.

Also, another novel method, chain-of-thought (CoT) decoding enables intermediate reasoning steps. This technique proved invaluable in reasoning-intensive tasks like solving mathematical problems or generating logical arguments.

However, GPT-3.5 still faced challenges:

Context Retention in Multi-Turn Dialogues: Extended conversations sometimes led to "context drift," where earlier parts of the dialogue were misinterpreted or forgotten.

Dynamic Intent Adaptation: The model struggled to seamlessly adjust to evolving user needs within a single session.

Other notable models developed during this time include:

PaLM (Pathways Language Model) (Google, 2022): With its 540 billion parameters, PaLM set a new benchmark in scaling, demonstrating advanced reasoning, multi-modal capabilities, and instruction-following. Its innovations in few-shot prompting allowed it to achieve state-of-the-art performance across diverse tasks, from translation to creative writing. PaLM’s development highlighted Google’s emphasis on combining scale with usability.

OPT (Open Pre-trained Transformer) (Meta, 2022): Meta’s OPT marked a significant step

Claude (Anthropic, 2023): Building directly on this wave of safety-centric innovation, Claude from Anthropic represented a deliberate shift toward aligning AI behavior with human values. Anthropic focused on embedding ethical guidelines and interpretability into its training process, emphasizing transparency and user safety. Claude excelled in conversational AI, offering responses that were not only accurate but also designed to avoid harm or controversy.

These limitations and insights collectively paved the way for the next step: ChatGPT, OpenAI’s first major release explicitly designed for dialogue, and other strong conversational models.

ChatGPT

ChatGPT, based on GPT-3.5, was OpenAI’s first major release explicitly designed for dialogue-based interactions. While GPT-3 and InstructGPT could handle prompts effectively, ChatGPT was optimized for multi-turn conversations, making it suitable for a broad range of interactive applications.

Conversational Context Management:
ChatGPT introduced advanced techniques for retaining and referencing context across multiple dialogue turns. This was achieved through:
- Dynamically updating internal conversation embeddings.
- Fine-tuning latent representations for dialogue coherence and continuity.

RLHF Refinements:
Building on the RLHF methodology, ChatGPT introduced:
- Evaluations not just for individual responses but for entire conversations.
- A reward model prioritizing conversational tone, relevance, and consistency across exchanges.

User-Focused Optimization:
ChatGPT’s responses felt more intuitive and conversational, making it accessible to non-technical users while retaining versatility for technical applications.

ChatGPT quickly became a versatile tool across domains, and continuing to be used everyday by millions of users. The true breakthrough of ChatGPT was not just technical, it was experiential. It marked the transition of LLMs from experimental tools to widely adopted, user-centric products. By addressing usability, safety, and conversational adaptability, ChatGPT set a new standard for AI-human interaction.

Despite its transformative capabilities, ChatGPT faced notable challenges:

Reasoning Limitations: While improved, multi-step reasoning and logical problem-solving occasionally faltered.

Factual Inaccuracies: The model sometimes generated confident but incorrect answers, a challenge tied to its reliance on probabilistic patterns from training data.

Contextual Drift: Long dialogues could still lead to subtle inconsistencies or misunderstandings.

These limitations mirrored those in models like LaMDA, PaLM, and Claude, which also wrestled with context retention and factual grounding. The iterative improvement of ChatGPT influenced the development of GPT-4, Anthropic’s Claude, Meta’s LLaMA 2, and Google’s Gemini, which aimed to address these challenges while introducing multimodal and task-specific capabilities.

Anthropic’s Claude series, starting in 2023, prioritized safety and ethical alignment. These models integrated human-centric design principles directly into their architecture, addressing

Claude 1: Marked the beginning of Anthropic’s journey into conversational AI, focusing on alignment with human preferences.

Claude 2: Improved coherence, contextual understanding, and adaptability in dialogues, bridging the gap between ethical considerations and competitive performance.

Claude 3 (2024): Expanded on its predecessors with better multi-domain adaptability and enhanced reasoning capabilities, making it a strong competitor in long-term conversational tasks.

Claude’s focus on alignment positioned it as a trusted choice for industries requiring safety-critical applications, from education to healthcare.

OpenAI’s GPT-4, released in 2023, built on the success of ChatGPT and GPT-3.5 by introducing major advancements in scale and functionality.

Enhanced Context Length: GPT-4 significantly extended the context window, enabling reasoning over larger inputs, such as entire documents or long conversations, without losing coherence.

Multimodal Capabilities: For the first time, GPT-4 supported both text and image inputs, unlocking applications like document analysis, image captioning, and visual question answering.

Improved Reasoning and Alignment: GPT-4’s enriched latent space and refined training methodology allowed for better multi-step reasoning and alignment with user intentions, addressing many of the limitations observed in GPT-3.5.

Accesibility of LLMs

LLaMA: Democratizing Access to High-Performance NLP

In 2023, Meta AI introduced LLaMA (Large Language Model Meta AI), a groundbreaking initiative that provided open access to cutting-edge NLP models. Unlike proprietary models such as OpenAI’s GPT series, LLaMA was designed to empower researchers and practitioners through transparency, efficiency, and accessibility. Its open-access ethos spurred a wave of innovation, including tools and frameworks that enhanced its usability, such as Ollama and llama.cpp.

LLaMA was released in multiple sizes (7B, 13B, 30B, 65B), catering to diverse computational needs. Smaller models provided competitive performance, while larger variants excelled in complex reasoning tasks. Its architecture emphasized computational efficiency, achieving state-of-the-art results with fewer parameters compared to proprietary counterparts.

LLaMA was trained on high-quality, diverse datasets, ensuring strong generalization capabilities. This focus on data quality minimized biases and enhanced robustness across tasks.

Alpaca (2023, Stanford University) demonstrated the potential of small-scale fine-tuning with LLaMA. By generating synthetic examples and fine-tuning the original LLaMA, Alpaca was optimized for instruction-following tasks. This effort highlighted how accessible techniques could produce models comparable to larger, proprietary systems, emphasizing the versatility of LLaMA for targeted applications.

Vicuna (2023, by LMSYS Group) advanced conversational AI by building on LLaMA and incorporating Reinforcement Learning from Human Feedback (RLHF). Vicuna’s emphasis on dialogue performance enabled it to rival commercial models like ChatGPT, showcasing the power of open-access models when enhanced with user-aligned feedback mechanisms.

LLaMA 2 (2023, Meta) was Meta’s major upgrade to the original LLaMA. It introduced instruction tuning, improving usability and making the model more aligned with user expectations out of the box. LLaMA 2 maintained Meta’s open-access philosophy, solidifying its role as a foundational model for research and development in NLP.

LLaMA 3 (2024, Meta) further pushed the boundaries of the LLaMA series, emphasizing multimodal capabilities and fine-grained instruction-following. LLaMA 3 integrated advanced reasoning and improved context handling, making it competitive with proprietary systems in both text and multimodal tasks. It also introduced modular architecture, allowing for more efficient scaling and adaptability across diverse domains while continuing the commitment to open-access development.

LLaMA’s release also catalyzed a surge of derivative models and tools, including:

Ollama: A tool designed to simplify the deployment and interaction with LLaMA models, focusing on usability for a wide range of users.

llama.cpp: An efficient framework enabling LLaMA-based models to run on consumer-grade hardware, including laptops and smartphones. This drastically reduced hardware requirements, making high-performance NLP more accessible.

LLaMA’s release redefined the NLP landscape, proving that innovation thrives in open-access paradigms. By prioritizing efficiency, adaptability, and community-driven development, LLaMA and its ecosystem, inspired a new wave of open-source NLP tools. This democratization has expanded the boundaries of what’s possible in language understanding and generation, fostering collaboration across academia and industry.

We now have open source large language models. But there is an elephant in the room: increasing the scale is not free.

Mixture of Experts (MoE) Models

As the field of large language models advanced, the demand for higher performance collided with the limits of scalability. Dense models like ChatGPT, while powerful, required all parameters to be active for every task, leading to exponential computational costs as the model size grew. The solution? Mixture of Experts (MoE) models—an innovative approach to achieving efficiency and specialization in large-scale neural networks.

Mixture of Experts models use sparse activation to dynamically allocate computational resources. Instead of engaging the entire model’s parameters for every input, only a subset of expert modules is activated based on the task or input. These experts are highly specialized sub-networks within the broader model, and a routing mechanism determines which experts are most relevant for the current task.

Developed by Google, the GShard framework was introduced in June 2020. It facilitated the scaling of multilingual neural machine translation models by incorporating sparsely-gated MoE layers, effectively managing models with over 600 billion parameters.

Building upon this foundation, the Switch Transformer was presented in January 2021. It simplified the MoE routing mechanism by assigning each input token to a single expert, enhancing training stability and computational efficiency. This innovation enabled the creation of models with up to a trillion parameters, demonstrating substantial improvements in pre-training speed and performance across various tasks.

In December 2023, Mistral AI released Mixtral 8x7B, an open-source and LLaMa based sparse MoE model comprising eight experts, each with 7 billion parameters, totaling 56 billion parameters. However, due to its sparse activation, only a fraction of these parameters are utilized during inference, resulting in faster processing times and reduced computational requirements. Notably, Mixtral outperformed models like Llama 2 70B and GPT-3.5 on several benchmarks, highlighting the efficacy of MoE architectures.

Despite their advantages, MoE architectures introduce new complexities:

Routing Inefficiencies: Determining the optimal routing strategy remains computationally intensive and can lead to imbalanced expert usage.

Overhead in Sparse Training: While inference is efficient, training MoE models requires sophisticated infrastructure and optimization techniques.

Complex Debugging: The modular nature of MoE models complicates debugging and performance evaluation.

MoE models give us a critical insight in the evolution of LLMs: scaling alone is not enough. There is a cycle between scaling and specialization.

LoRA: Unlocking Parameter Efficiency

Low-Rank Adaptation (LoRA) is a breakthrough in fine-tuning large language models, designed to significantly reduce the computational and memory overhead required for adapting these models to new tasks. Unlike traditional fine-tuning, which updates all model parameters, LoRA focuses on modifying a small subset of parameters, maintaining the rest of the pre-trained weights in a frozen state.

LoRA introduces additional low-rank matrices into the model, representing the weight updates in a compact and efficient manner. These matrices are added to the frozen pre-trained weights during inference. The key idea is to leverage the observation that many updates in neural networks lie in a low-dimensional subspace.

This allows LoRA to significantly reduce the number of trainable parameters. And by reducing the number of trainable parameters, LoRA drastically lowers the computational and memory requirements for fine-tuning.

LoRA has gained widespread adoption for fine-tuning models like LLaMA and other open-access models. In real-world deployments, LoRA has enabled researchers to adapt state-of-the-art models for tasks such as sentiment analysis, domain-specific text generation, and machine translation, often on consumer-grade hardware.

Quantization: Precision Reduction

In standard neural networks, weights and activations are typically stored in 32-bit floating-point precision (FP32). Quantization reduces this precision to lower-bit formats (like 8-bit, 4-bit, or even 1-bit). This reduction leads to decreased memory usage and faster computation, facilitating the deployment of large language models on hardware with limited resources.

Quantization has revolutionized how large language models are deployed, with tools like llama.cpp enabling efficient use of quantized models on consumer hardware. Hugging Face now supports .gguf models, a quantized format tailored for LLaMA and other open-source models. This format allows users to run high-performance models on laptops, desktops, and even smartphones, democratizing access to cutting-edge NLP capabilities.

Quantization has been pivotal for deploying models like LLaMA 2 and Mistral in production settings. These models, often trained with billions of parameters, achieve competitive performance in tasks like summarization and question answering while running efficiently on edge devices. By combining quantization with frameworks like LoRA, researchers and developers can now fine-tune and deploy state-of-the-art models with unprecedented efficiency.

Recent advancements have pushed the boundaries of quantization. Research initiatives, notably Microsoft, have introduced 1-bit Large Language Models (LLMs) like BitNet b1.58, where each parameter is ternary (-1, 0, 1). These models match the performance of full-precision counterparts while offering significant reductions in latency, memory footprint, and energy consumption.

And to facilitate the deployment of 1-bit LLMs on local devices, Microsoft released BitNet.cpp, an inference framework that enables efficient execution of 1-bit models on standard CPUs without the need for specialized hardware. This development democratizes access to advanced language models, making them more accessible for various applications.

Quantization surely reduces the precision of weights, but critical latent space properties mostly remain intact. Semantically similar tokens still remain close.

Scaling Test-Time Compute

The research from the paper Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters by Google (2024) introduces a novel perspective on the trade-off between pretraining and test-time computation. Instead of solely expanding model parameters and pretraining, the study explores methods to maximize the efficiency of inference-time computation.

Compute-Optimal Test-Time Scaling approach adjusts the allocation of compute resources during inference based on the difficulty of the prompt. It integrates strategies like:

Sequential Revisions: Models iteratively refine their outputs, learning from their initial responses.

Verifier-Guided Search: Uses process reward models (PRMs) to evaluate intermediate and final steps, allowing tree-search or beam-search to optimize the solution space.

The findings highlight that test-time compute can often substitute for large-scale pretraining. For "easy" and "medium" tasks, test-time compute with a smaller model can outperform a 14x larger pretrained model. However, pretraining larger models remains essential for "hard" tasks.

This research shows the potential for smarter resource use at inference time, offering a compelling alternative to relentless scaling of model parameters. It highlights a paradigm shift in how we approach model training and deployment, paving the way for more sustainable and efficient AI systems.

Also, a recent blog post by Hugging Face examines search-based methods as a means to optimize test-time compute. These techniques involve generating multiple candidate responses during inference and selecting the most appropriate one, thereby improving the model's output quality. The blog post mentions three specific strategies:

Best-of-N Sampling: This method generates 'N' responses for a given input and evaluates each to select the best outcome. By increasing 'N', the likelihood of producing a high-quality response improves, albeit with a corresponding rise in computational cost.

Beam Search: Beam search maintains multiple hypotheses at each step of generation, expanding the most promising ones. This approach balances exploration and exploitation, aiming to find the most likely sequence of words for a given input.

Diverse Verifier Tree Search (DVTS): An extension of traditional tree search methods, DVTS introduces diversity in the search process by incorporating verifier models that assess the validity of different branches. This technique enhances the model's ability to explore varied solutions, leading to improved performance, especially in complex tasks.

These methods leverage the structure of the latent space by exploring multiple trajectories for embeddings. This ensures the model can refine or discard less plausible outputs based on the task's context.

When solving a reasoning-intensive task, Diverse Verifier Tree Search (DVTS) ensures embeddings for potential solutions are optimized for diversity. For example, in a question-answering task, embeddings for "solar power efficiency" are explored alongside "wind energy efficiency", leading to a richer set of candidate answers.

To facilitate the application of search-based methods, Hugging Face has introduced a repository containing scripts and recipes for implementing different search algorithms at test-time.

Empirical studies have demonstrated that optimizing test-time compute can enable smaller models to outperform larger counterparts. For instance, a 3-billion parameter LLaMa model, when equipped with compute-optimal scaling strategies, surpassed the performance of a 70-billion parameter model on complex mathematical tasks. This finding underscores the efficacy of test-time compute optimization in achieving superior results without escalating model size.

Reasoning Models

GPT-o1

Released in 2024, GPT-o1 marked a transformative leap in how large language models approached logical reasoning tasks. While prior models like GPT-3.5 and GPT-4 excelled at generating fluent and coherent outputs, they often struggled with maintaining consistency and accuracy over multi-step reasoning processes. GPT-o1 addressed these limitations by incorporating a novel framework explicitly designed for step-by-step logical deductions.

GPT-o1 was trained on datasets curated specifically for reasoning tasks, such as mathematical problem-solving, programming, and scientific deduction. The training data emphasized logical progression, ensuring that the model could generate intermediate steps for complex problems.

Building on the success of Chain-of-Thought (CoT) prompting, GPT-o1 integrated iterative reasoning loops into its architecture. This allowed the model to self-correct by revisiting earlier steps in its reasoning, minimizing errors in multi-step tasks.

GPT-o1 also introduced dynamic memory modules, enabling the model to update its understanding of the context as it progressed through a problem. This feature was crucial for handling tasks requiring coherence over long sequences, such as scientific proofs or extended code debugging.

GPT-o1 set a new standard for reasoning-focused LLMs. By combining logical rigor with dynamic adaptability, it opened new avenues for AI applications in STEM fields and industries requiring high precision.

However, its iterative reasoning process demanded higher computational resources, which highlighted the ongoing trade-offs between capability and efficiency.

GPT-o3

Shortly after, GPT-o3 built upon the successes of GPT-o1, pushing the boundaries of reasoning capabilities through advanced scaling and multi-domain integration. GPT-o3 was designed not just for reasoning but for reasoning at scale, enabling it to tackle deeply complex problems that required interdisciplinary knowledge and long-term coherence.

With a significantly larger parameter count and a training dataset spanning diverse scientific, mathematical, and ethical domains, GPT-o3 achieved unparalleled generalization. Its reasoning capabilities extended to intricate multi-modal tasks, combining text, images, and structured data.

GPT-o3 leveraged inference-time scaling techniques, such as adaptive compute allocation and context prioritization, to optimize computational efficiency during problem-solving. These innovations minimized the latency associated with its reasoning processes, making it practical for real-world deployment.

GPT-o3 achieved a landmark victory by winning the ARC Prize (Advanced Reasoning Challenge), awarded for its exceptional performance in solving complex logical and ethical dilemmas. This accolade underscored GPT-o3’s ability to handle tasks traditionally reserved for human experts, such as crafting ethical policies, solving mathematical proofs, and generating interdisciplinary research hypotheses.

GPT-o3 redefined the capabilities of reasoning models, setting a new benchmark for AI in handling complex, multi-dimensional problems. Its success illustrated the potential for AI to contribute meaningfully to domains traditionally reserved for human expertise, such as advanced scientific research and ethical policymaking.

However, as in GPT-o1, the computational demands of GPT-o3 emphasized the importance of ongoing innovation in efficiency and accessibility, ensuring that such powerful models could be broadly utilized.

Conclusion

At the start of our journey, I intended to mention the development of current LLMs in a single, brief blog post. But, certainly this topic deserved a rigorous exploration.

In the next two post, I want to write about utilization of LLMs, and multimodal models.

I always think that current LLMs are severely underutilized, and new methods are waiting to be explored. There is a cycle of scaling and specialization, and I think we came to the specialization part of the cycle.