Foundations of the Future : The Papers That Built ABCs of LLMs

18 minute read

Published:

The Papers That Built ABCs of LLMs

From sequence-to-sequence models to transformers that changed everything


Timeline of Innovation

DatePaper TitleAuthorsOrganization(s)Link
10 Sep 2014Sequence to Sequence Learning with Neural NetworksIlya Sutskever, Oriol Vinyals, Quoc V. LeGooglearXiv:1409.3215
12 Jun 2017Attention Is All You NeedAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia PolosukhinGoogle Brain, Univ. of TorontoarXiv:1706.03762
16 Jun 2017One Model To Learn Them AllŁukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, Jakob UszkoreitGoogle Brain, Univ. of TorontoarXiv:1706.05137
18 Jul 2017Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)Chelsea Finn, Pieter Abbeel, Sergey LevineUC Berkeley, OpenAIarXiv:1703.03400
11 Jun 2018Improving Language Understanding by Generative Pre-Training (GPT-1)Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya SutskeverOpenAIPaper (OpenAI Blog)
11 Oct 2018BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingJacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina ToutanovaGoogle AI LanguagearXiv:1810.04805

1. Sequence to Sequence Learning (2014): Where It All Began

The 2014 NeurIPS paper “Sequence to Sequence Learning with Neural Networks” introduced the original seq2seq framework, a groundbreaking idea that powered the first wave of neural machine translation systems. Using deep LSTMs, it showed that you could map one sequence (like an English sentence) directly into another (like its French translation) in an end-to-end fashion — no hand-engineered rules or phrase tables needed.

How Seq2Seq Works

Encoder–Decoder LSTM: A multilayer LSTM encoder reads the input sentence and compresses it into a single fixed-length vector. Then a decoder LSTM generates the output sentence, one word at a time, from that vector.

Variable-Length Inputs: Unlike older models that required fixed-size inputs, the encoder LSTM could handle sequences of any length, making it perfect for natural language.

Sentence Embeddings: The model learned dense representations of sentences that preserved word order and captured semantic similarity — sentences with similar meaning clustered together in embedding space.

Reversing Source Sentences: A clever trick that helped optimization: reversing the input word order made training easier by shortening dependency paths, improving gradients, and boosting performance.

Large-Scale Training: The team trained huge LSTMs (4 layers, 1000 units each, ~380M parameters) on millions of parallel sentences (WMT’14 English–French), carefully stabilizing training with gradient clipping and batching.

Challenges They Faced

Bottleneck of a Fixed Vector: Compressing an entire sentence into one vector often meant information loss, especially for long sequences. (This was later fixed with attention mechanisms.)

Training Long Dependencies: Deep LSTMs struggled with very long-range dependencies. The reversal trick helped, but training was still tough.

Vocabulary Limits: They used fixed vocabularies (80k–160k words), with unknown words mapped to an “UNK” token — hurting translation quality.

Data Hungry: Training required huge bilingual corpora, which not all languages have.

Decoding Complexity: Getting good translations required beam search decoding, which added compute and tuning overhead.

Why It Was a Big Deal

State-of-the-Art Results: Their model hit a BLEU score of 34.8 on WMT’14 English–French, beating strong phrase-based translation systems.

Surprisingly Good with Long Sentences: Thanks to the reversal trick, it translated long sentences without a drop in performance — something many doubted was possible.

Learned Rich Representations: Sentence embeddings captured word order and semantics without any hand-designed features.

Set the Stage for Attention and Transformers: This was the proof of concept that deep neural networks could handle sequence-to-sequence tasks, directly inspiring Bahdanau’s attention mechanism (2015) and eventually the Transformer (2017).

The Seq2Seq paper showed that LSTMs could translate entire sentences end-to-end with strong results, kicking off the neural machine translation era. Its encoder–decoder design and input reversal trick became the foundation for almost everything that came next in NLP.


2. Attention Is All You Need (2017): The Revolution

The 2017 paper “Attention Is All You Need” introduced the Transformer architecture, which completely reshaped NLP (and eventually deep learning as a whole). Instead of relying on recurrence (RNNs/LSTMs) or convolutions, the Transformer used only attention mechanisms to model sequences. This made it massively faster to train, more parallelizable, and ultimately far more powerful.

How the Transformer Works

Encoder–Decoder Design: Both the encoder and decoder are stacks of layers, each with self-attention + feedforward networks, plus residual connections and layer norm to keep training stable.

Multi-Head Attention: Instead of a single attention head, the model learns to “look” at different positions in parallel, capturing multiple types of relationships in the sequence.

Scaled Dot-Product Attention: The core math trick—dot products between queries and keys are scaled by the square root of their dimension, preventing unstable softmax gradients.

Positional Encoding: Since the model doesn’t have recurrence, it uses sine/cosine functions to encode token positions, letting it learn both relative and absolute ordering.

Feedforward Networks: Identical little MLPs applied at each position to transform representations.

Weight Sharing: Input/output embeddings and the pre-softmax transformation share the same weights, keeping the model efficient.

Why It Was a Big Deal

Faster Training: With no recurrence, the Transformer was fully parallelizable, cutting training time drastically compared to RNN-based models.

SOTA Performance: It set new records on WMT 2014 English–German (BLEU 28.4) and English–French (BLEU 41.8) translation, beating both RNNs and CNNs, even large ensembles.

Generalization Beyond Translation: It also worked well for tasks like parsing, showing it wasn’t just a translation trick.

Challenges They Dealt With

Long-Range Dependencies: RNNs struggled with long sequences, but attention solved this by letting every token directly connect to every other token.

Computational Cost: Self-attention has quadratic complexity in sequence length, which is a bottleneck for very long texts (still an active research problem today).

Overfitting Risks: They tackled this with dropout, label smoothing, checkpoint averaging, and carefully tuned learning schedules.

Position Encoding Choices: The sinusoidal encoding worked well and could extrapolate to longer sequences. Learned encodings didn’t offer much improvement in the original experiments.

Scaling Up: Larger models consistently performed better, but they required careful regularization and hyperparameter tuning.

Interpretability: Attention heads showed interesting patterns (like focusing on syntactic relationships), but truly “explaining” them remained tricky.

Why It Still Matters

The Transformer was more than just a better translation model — it was a paradigm shift. By removing recurrence and convolution, it created a scalable, general-purpose architecture that became the foundation for almost everything in modern NLP: BERT, GPT, T5, and beyond. Today, Transformers have spread far outside language, powering breakthroughs in vision, speech, protein folding, and multimodal AI.

The Transformer proved that attention alone was enough to outperform everything else in sequence modeling — and it lit the fuse for the entire modern deep learning era.


3. One Model To Learn Them All (2017): The Universal Dream

The MultiModel paper from Google was the first serious attempt at building a single deep learning model that could handle a bunch of very different tasks—things like ImageNet image classification, machine translation, COCO captioning, and even speech recognition—all under one umbrella. The crazy part? It pulled this off with only a little bit of task-specific code and by sharing most of the parameters across domains.

How MultiModel is Put Together

One Model, Many Domains: Instead of making a separate model for vision, language, and speech, MultiModel brought them all together. This was the first time a unified model was competitive across such different areas.

Modality Nets: Each input type (images, text, audio) goes through its own small sub-network, which transforms it into a shared representation. The same happens in reverse on the output side. This keeps the model flexible while avoiding bottlenecks.

Shared Subnets Across Tasks: Tasks from the same domain reuse the same modality-net—for example, all translation tasks share one text subnet. This not only reduces complexity but also makes it easier to add new tasks.

Core Blocks That Work Everywhere: At the heart of the model are building blocks that turned out to be useful across all tasks, including:

  • Depthwise-separable convolutions for local feature extraction.
  • Multi-head attention for context and relationships.
  • Sparsely-gated mixture-of-experts (MoE) for adding tons of capacity without blowing up compute costs.

These were originally designed for specific domains, but MultiModel showed they actually help across the board.

Handling Speech

For speech input, the model could take raw 1D waveforms or 2D spectrograms. Instead of RNNs (which were the norm in speech at the time), MultiModel relied on CNN-based residual blocks—basically reusing its image processing backbone to handle spectrograms.

The Challenges

Performance Trade-offs: MultiModel didn’t quite match the absolute best specialized models on big datasets like ImageNet, though it came surprisingly close given it was juggling so many domains at once.

Complexity: The mix of convolutions, attention, MoEs, and residuals made it a beast to implement and train.

Hard to Explain: The model showed some unexpected cross-task transfer (like image training helping with parsing), but researchers didn’t fully understand why—its complexity made it tough to analyze.

Resource Heavy: Training across multiple large datasets demanded serious compute, which put it out of reach for smaller labs.

Why It Mattered

MultiModel was a proof of concept that general-purpose, multimodal AI wasn’t just a dream—it was actually possible. Even if it didn’t beat every specialist model, it sparked a wave of interest in unified architectures and laid the groundwork for today’s push toward generalist AI systems.


4. Model-Agnostic Meta-Learning (2017): Learning How to Learn

The Model-Agnostic Meta-Learning (MAML) paper introduces a really flexible framework for meta-learning that’s all about helping models adapt quickly with just a little bit of data. What makes it stand out is that it’s not tied to a specific model type—you can use it with pretty much anything trained by gradient descent: fully connected nets, CNNs, even reinforcement learning policies.

How MAML Works

Task Distribution: Instead of training on one fixed problem, MAML samples tasks from a distribution. Each task is its own mini learning problem—like classifying new image categories or solving a slightly different RL environment.

Inner Loop (Quick Adaptation): For each sampled task, the model does a handful of gradient updates on a small support set. This simulates “fast learning” as if the model just got exposed to a brand-new problem.

Outer Loop (Meta-Learning): After that, the model is tested on the query set for the same task, and those results are used to update the original parameters. The idea is to tune the starting point of the model so it can adapt really quickly to any new task.

Why It’s a Big Deal

Few-Shot Learning: MAML nails the problem of learning from tiny amounts of data, setting strong benchmarks in few-shot regression and classification.

Reinforcement Learning: It’s not just for supervised tasks—MAML helps RL agents adapt to new environments with very little experience, like solving a maze they’ve never seen before.

Generalization: Models trained with MAML often adapt better than those trained with traditional fine-tuning or transfer learning.

The Catch

Computationally Heavy: Because it requires backpropagating through the adaptation steps, MAML can get expensive—especially with deeper networks or more gradient steps.

Sensitive to Task Distribution: It assumes tasks are somewhat related. If your task distribution is all over the place, the model’s ability to adapt drops off.

Gradient Complexity: The full version of MAML uses second-order derivatives, which adds to the compute load. First-order versions (like FOMAML or Reptile) cut costs but sometimes lose performance.

Hyperparameter Tuning: Learning rates and the number of inner/outer loop steps matter a lot, and you often have to tweak them carefully per domain.

Summary Table

ComponentDescriptionChallenge
Task SamplingDraw tasks from a distribution for meta-trainingTask heterogeneity sensitivity
Inner LoopUpdate model on task-specific support setComputational cost (differentiability)
Outer LoopUpdate shared initialization across tasksMemory usage, 2nd-order gradients
ApplicabilityWorks for many architectures (CNNs, FC, RL)Hyperparameter tuning required

MAML is about fast adaptation — giving models a great starting point so they can learn new tasks almost instantly.

MultiModel is about broad generalization — proving that one architecture can juggle many domains at once, even if it’s not the absolute best at each.


5. GPT-1 (2018): The Generative Revolution Begins

The paper “Improving Language Understanding by Generative Pre-Training” (aka the first GPT paper) was a breakthrough in NLP. It introduced the idea that you could train one big, task-agnostic Transformer model on massive amounts of text, and then fine-tune it for specific tasks — achieving state-of-the-art performance across a wide variety of benchmarks.

How GPT Works

Two-Stage Training

Generative Pre-Training: First, the model is trained like a standard language model — predicting the next word on huge amounts of raw text. This step teaches it general grammar, semantics, and even a bit of world knowledge.

Discriminative Fine-Tuning: Then, the same model is fine-tuned on supervised tasks (like sentiment analysis, QA, or natural language inference). Sometimes an auxiliary language modeling loss is kept during fine-tuning to stabilize learning.

Model Architecture

A 12-layer, decoder-only Transformer with masked self-attention, 768 hidden units, and 12 heads. Uses learned positional embeddings + byte-pair encoding (BPE) for handling a flexible vocabulary. No task-specific heads or complex architectural add-ons.

Traversal-Style Input Formatting

Instead of designing separate models for different tasks, the inputs are serialized into simple token sequences with special delimiters. Examples:

  • NLI → [Premise] $ [Hypothesis]
  • QA → [Context] $ [Question] $ [Answer Option]
  • Classification → [Text Input]

This means the same model can handle tasks with single sentences, pairs, or triplets — just by changing the input string.

What They Were Aiming For

Universal Language Model: Train one general model that learns useful representations for any NLP task.

Minimal Task Engineering: Instead of designing different architectures for different tasks (like siamese nets for similarity or cross-attention for QA), they wanted one simple, reusable framework.

Better Transfer: Use the abundance of unlabeled text to make up for the fact that labeled datasets are often tiny.

Challenges They Tackled

Lack of Labeled Data: Pre-training gave the model strong priors, so it performed much better even in low-data settings.

Which Objective Works Best? Before GPT, it wasn’t clear which unsupervised training methods transferred well. They made the case for language modeling as the most effective general-purpose objective.

Complex Model Design: Prior systems often needed custom architectures per task. GPT showed that a simple serialization trick + one shared model could work everywhere.

Proving Transfer Works: GPT beat state-of-the-art models (including ensembles) on 9 out of 12 tasks, spanning GLUE, RACE QA, commonsense reasoning, and sentiment.

Why It Mattered

The big contribution of this paper was showing that large-scale pre-training + lightweight fine-tuning could outperform hand-designed, task-specific systems. By unifying task inputs through serialization and keeping the architecture simple, it laid the foundation for today’s large-scale language models — GPT-2, GPT-3, and beyond.


6. BERT (2018): The Bidirectional Breakthrough

The 2018 paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” introduced BERT, a model that quickly became one of the most influential architectures in NLP. By pre-training a deep, bidirectional Transformer encoder and then fine-tuning it on specific tasks, BERT pushed performance on a wide range of benchmarks to entirely new levels.

How BERT Works

Bidirectional Pre-Training: Unlike earlier models (like GPT, which is left-to-right) or ELMo (which stitches left-to-right and right-to-left together), BERT jointly looks both left and right at every layer. This is made possible by its Masked Language Model (MLM) objective.

Unified Model: The same pre-trained network can be fine-tuned with minimal changes for tasks like classification, QA, NLI, or sentence similarity — just add the right output head.

Model Sizes:

  • BERTBASE: 12 layers, 110M parameters.
  • BERTLARGE: 24 layers, 340M parameters.

Core Components

Architecture: Multi-layer Transformer encoder (same base structure as the original Transformer’s encoder stack).

Input Representation: Sentences are packed into a single token sequence using WordPiece embeddings (30k vocab), with [CLS] at the start and [SEP] separating sentences.

Pre-Training Objectives:

  • Masked Language Model (MLM): Randomly mask 15% of tokens and predict them — forcing the model to use both left and right context.
  • Next Sentence Prediction (NSP): A binary task to predict whether one sentence follows another, teaching the model inter-sentence relationships.

Fine-Tuning: The entire model (not just embeddings) is fine-tuned on supervised tasks with minimal modifications, making it highly adaptable.

What Made It a Big Deal

State-of-the-Art Performance: BERT crushed benchmarks across the board — GLUE, SQuAD (v1.1 and v2.0), and SWAG — improving results by large margins.

Bidirectional Context: MLM enabled deep bidirectional understanding, which was a game-changer compared to earlier unidirectional models.

Versatility: Whether it was classification, QA, or sentence-pair tasks, the same model architecture worked with only tiny tweaks.

Challenges They Faced

Pre-Training vs Fine-Tuning Mismatch: The [MASK] token used during pre-training never appears during fine-tuning. To mitigate this mismatch, some masked tokens were replaced with random or unchanged tokens.

Fine-Tuning Instability: Larger models (like BERTLARGE) often became unstable on small datasets — requiring multiple random restarts and careful hyperparameter tuning.

Compute Hungry: Training BERT, especially the large version, demanded massive compute resources and long training times.

Feature-Based vs Fine-Tuning Trade-Offs: While BERT is strongest when fine-tuned end-to-end, some tasks (e.g., with limited data or special architectures) still had to rely on using BERT as a frozen feature extractor, which underutilized its full potential.

Why It Still Matters

BERT wasn’t just an incremental improvement — it shifted the entire paradigm of NLP. It showed that large-scale bidirectional pre-training, paired with flexible fine-tuning, could outperform specialized architectures across nearly every task. From there, it set the stage for successors like RoBERTa, ALBERT, DistilBERT, and eventually the massive GPT-style generative models.

BERT gave NLP its “ImageNet moment.” By combining deep bidirectional Transformers with clever pre-training objectives, it redefined what was possible in language understanding — though at the cost of huge compute demands.