Foundations of the Future : Trying to get behind Research Papers That Changed the Game

3 minute read

Published: September 23, 2025

CNNs ,RNNs and LSTM

Paper	Link	Key Highlights	Year
Learning Representation by Backpropagation	PDF	• Introduced the backpropagation learning procedure for training multilayer neural networks, enabling them to learn internal representations and features essential for solving complex tasks unattainable by simple perceptrons. • Introduced momentum in gradient descent for acceleration and weight-decay regularization to improve generalization. • Demonstrated that multi-layer networks could successfully approximate complex nonlinear mappings.	1985
Backpropagation Applied to Handwriting Zip Code Recognition	PDF	• Developed the convolutional neural network (CNN) architecture for image recognition, applying backpropagation to recognize handwritten ZIP codes from raw pixel input. • Showed CNNs could automatically learn relevant feature hierarchies (edges, shapes, etc.) directly from data without manual feature engineering. • Highlighted challenges in generalizing to out-of-distribution data and handling varied handwriting styles or noise.	1989
Learning Long Term Dependencies with Gradient Descent is Difficult	PDF	• Analyzed why standard gradient descent methods (such as backpropagation-through-time) struggle to learn long-term dependencies in recurrent neural networks. • Identified the vanishing and exploding gradient problem, where error signals decay exponentially or grow uncontrollably. • Theorized with proofs and experiments that problems requiring retention of information over long intervals are essentially inaccessible to standard RNN training.	1994
Long Short-Term Memory (LSTM)	PDF	• Identified the vanishing/exploding gradient problem in RNNs. • Proposed the LSTM architecture. • Demonstrated experimentally that LSTM can bridge minimal time lags of hundreds or thousands of steps where previous methods fail. • Shown to have favorable complexity: O(1) per step/weight, similar to BPTT, and local in space and time.	1997

MAML , Seq2Seq ,Transformers , BERT & GPT

Paper	Core Goal	Key Idea / Architecture	Strengths	Limitations	Legacy / Impact
Seq2Seq (2014) Sutskever, Vinyals, Le	End-to-end sequence mapping (e.g., machine translation).	Encoder–decoder LSTMs, with a “reverse source” trick to ease training.	First strong neural MT system, robust to long sentences, learned semantic sentence embeddings.	Bottleneck of fixed vector; vocab limits (UNK); data hungry; hard optimization.	Sparked neural MT revolution; laid groundwork for attention + Transformers.
Transformer (2017) Vaswani et al., “Attention Is All You Need”	Faster, more scalable sequence modeling.	Pure self-attention (no recurrence or conv), multi-head attention, positional encodings.	Parallelizable; better long-range modeling; SOTA in MT.	Quadratic self-attention cost; still compute heavy for long sequences.	Paradigm shift; foundation of modern NLP, vision, multimodal models.
GPT (2018) Radford et al., “Improving Language Understanding by Generative Pre-Training”	Universal transfer for NLP tasks.	Two-stage: pre-train on LM objective (unidirectional) → fine-tune via task serialization.	Strong transfer; simple traversal-style input; SOTA on many NLP tasks.	Unidirectional context; task mismatch risks; smaller scale vs successors.	Pioneered pre-train → fine-tune pipeline; ancestor of GPT-2/3/4.
BERT (2018) Devlin et al.	Deep bidirectional pre-training for language understanding.	Transformer encoder with MLM + NSP; fine-tune for downstream tasks.	Huge SOTA gains on GLUE, SQuAD, etc.; bidirectional context; versatile with minimal task tweaks.	[MASK] mismatch; fine-tuning instability; heavy compute cost.	“ImageNet moment” for NLP; inspired RoBERTa, ALBERT, DistilBERT, and more.
MAML (2017) Finn et al.	Meta-learning for fast adaptation to new tasks.	Inner loop (task adaptation) + outer loop (meta-optimization) to learn good initial weights.	Few-shot learning success; general framework (works for vision, RL, etc.).	Expensive (2nd-order gradients); sensitive to task distribution + hyperparams.	Landmark in meta-learning; inspired FOMAML, Reptile, ANIL, etc.
MultiModel (2017) Kaiser et al.	Unified multitask, multimodal learning.	Shared Transformer-style core + modality-specific subnets for text, vision, speech.	First real multimodal generalist; showed surprising cross-domain transfer.	Lagged behind top specialist models; very resource heavy; complex to train.	Early blueprint for generalist AI; influenced multimodal transformers (PaLM-E, Flamingo, GPT-4V).

Share on

Twitter Facebook LinkedIn

You May Also Enjoy

Is My RAG System Over-Engineered? I Tested It.

5 minute read

Published: October 15, 2025

Is My RAG System Over-Engineered? I Tested It.

I Benchmarked 6 Vector Databases for RAG — Here’s What Surprised Me Most

6 minute read

Published: October 09, 2025

I Benchmarked 6 Vector Databases for RAG — Here’s What Surprised Me Most

Foundations of the Future : The Papers That Built ABCs of LLMs

19 minute read

Published: September 25, 2025

The Papers That Built ABCs of LLMs

How AI Turns Noise into Art

8 minute read

Published: February 09, 2025

Welcome to the wonderfully wacky world of AI image generation, where noise isn’t just unwanted static—it’s the secret sauce that transforms raw data into breathtaking visuals.
If you’ve ever wondered how an algorithm goes from a garbled mess of pixels to a photorealistic masterpiece, buckle up: we’re about to diffuse some serious knowledge!