Large Language Models, Myths and Realities

Everything you wanted to know about Deep Learning and Large Language Models but were afraid to ask

Tanguy Lefort

Université de Montpellier, IMAG

CNRS, Inria, LIRMM

François-David Collin

CNRS

IMAG

2025-10-06

Generalities on Machine Learning and Artificial Intelligence

Historical perspective on AI

AI was all about symbolic reasoning until the 1980s: culmination in “Expert Systems”, which are abandoned:

  • 🧱 based on a fixed and very hard to improve set of rules
  • 💥 fails miserably in “edge” case and with outliers’ data
  • 🧑‍🔬 relies on a bunch of expert domains and general reliability is very hard to assess

\Rightarrow 🧠 an alternative neuro-computing branch of AI, but…

… in mid ’80s they seem to have both failed to deliver on their promises. 🤷‍♂️

Historical perspective timeline

AI domains

Artificial Intelligence (AI)

Machine Learning (ML)

Deep Learning (DL)

Large Language Models (LLM)

Train a neural network

  • Class of prediction functions f_\theta: linear, quadratic, trees
  • Loss \mathcal{L}: L^2 norm, CrossEntropy, purity score
  • Optimizer: SGD, Adam, …
    • learning rate \eta: \theta_{k+1} \gets \theta_k - \eta \nabla_\theta \mathcal{L}
    • other hyperparameters
  • Dataset:
    • training: \{(x_i, y_i)\}_{i} to compute loss between prediction f_{\theta}(x_i) and label y_i to update \theta
    • test: only compute performance scores (no more updates !)

A quick survey of Deep Learning

Foreword, beware the Alchemy

  • 📜 More or less theoretical guarantees

    • 🧑‍🔬 field of research
    • 🧠 type of network
    • 🔬 from theory to applications: a gap
  • 🛠️ Myriad of ad-hoc choices, engineering tricks and empirical observations

  • 🚦 Current choices are critical for success: what are their pros and cons?

  • 🔄 Try ➔ ❌ Fail ➔ 🔁 Try again is the current pipeline

Science and/or Alchemy?

2017 NIPS Ali Rahimi
  • 🤔 Criticable current state of Deep Learning research,
  • ⚠️ lack of scientific rigor in the field.

Vigorous response from Y. LeCun and his “followers”
  • arguing that mathematical rigor is not critical in Deep Learning research 🤷‍♂️
  • the field is doing just fine without it. 🚀

Criticizing an entire community (and an incredibly successful one at that) for practicing “alchemy” 🧪, simply because our current theoretical tools haven’t caught up with our practice is dangerous. Why dangerous? It’s exactly this kind of attitude that led the ML community to abandon neural nets for over 10 years, despite ample empirical evidence that they worked very well in many situations. (Yann LeCun, 2017, My take on Ali Rahimi’s “Test of Time” award talk at NIPS.)

The main ingredients

  • 🧮 Tensor algebra (linear algebra)
  • 🔁 Automatic differentiation
  • 🏃‍♂️ (Stochastic) Gradient descent
  • 🛠️ Optimizers
  • ⚡ Non-linearities
  • 📦 Large datasets

Also, on hardware side:

  • 🖥️ GPU
  • 🌐 Distributed computing

shape=(batch, height, width, features)

Tensor algebra

  • Linear algebra operations on tensors
  • MultiLayerPerceptron = sequence of linear operations and non-linear activations

f(x)=\phi_{L}\!\Big(W_{L}\,\phi_{L-1}\big(W_{L-1}\,\cdots \phi_{1}(W_{1}x+b_{1})\cdots + b_{L-1}\big)+ b_{L}\Big)

\Rightarrow input can be anything: images, videos, text, sound, …

x = \mathrm{vec}\!\Big( \underbrace{T_{\text{img}}}_{\in \mathbb{R}^{H\times W\times C}} \;\Vert\; \underbrace{T_{\text{text}}}_{\in \mathbb{R}^{L\times d_w}} \;\Vert\; \underbrace{T_{\text{audio}}}_{\in \mathbb{R}^{T\times d_a}} \;\Vert\; \underbrace{T_{\text{video}}}_{\in \mathbb{R}^{F\times H'\times W'\times C'}} \Big)

Automatic differentiation

  • 🔗 Chain rule to compute gradient with respect to \theta
  • 🗝️ Key tool: backpropagation
    • 🧠 Don’t need to store the computation graph entirely
    • ⚡ Gradient is fast to compute (a single pass)
    • 🧮 But memory intensive

f(x)=\nabla\frac{x_{1}x_{2} sin(x_3) +e^{x_{1}x_{2}}}{x_3}

\begin{darray}{rcl} x_4 & = & x_{1}x_{2}, \\ x_5 & = & sin(x_3), \\ x_6 & = & e^{x_4}, \\ x_7 & = & x_{4}x_{5}, \\ x_8 & = & x_{6}+x_7, \\ x_9 & = & x_{8}/x_3. \end{darray}

Gradient descent

Example with a non-convex function f(x_1, x_2) = (x_1^2 + x_2 - 11)^2 + (x_1 + x_2^2 - 7)^2

Sensitivity to initial point and step size

(Stochastic) Gradient descent

  • 🚫 not use all the data at once to compute the gradient
    • 🧠 not feasible in practice (memory wise)
  • 🗂️ Use mini-batch of data (bootstrap samples)
    • ⚙️ one more hyperparameter…

\theta_{k+1} \leftarrow \theta_k - \frac{\eta}{n}\sum_{i\in\text{batch}}\nabla_\theta \mathcal{L}(f_\theta(x_i), y_i)

\Rightarrow 🚫 No general guarantees of convergence in DL setting

Optimizers

SGD, Adam, RMSProp

  • 🔬 Non-convex optimization research on the subject is still very active, and there is no clear consensus on what is the best optimizer to use in a given situation.
  • ❌ No guarantee of global minimum, only local minimum
  • ⚠️ No guarantee of convergence, only convergence in probability

(More than) a pinch of non-linearities

  • 🔀 Linear Transformations + ⚡ Non-linear activation functions
  • 🚀 Radically enhance the expressive power of the model
  • 🧭 Ability to explore the space of functions in gradient descent.

Train a Large Language Model (LLM)

From text to numbers

  • 🧮 Main problem: we can’t multiply or do convolutions with words
  • 📚 Second problem: many words (for a single language)
  • 🧠 Third problem: how to capture semantics?

Embeddings

  • Distance between words should not be character based

women

woman

window

widow

Tanguy Lefort, 2023

Embeddings

  • Distance between words should not be character based

widow

women

woman

window

Tanguy Lefort, 2023

Multi-scale learning from text

  • 🏗️ DL layers = capture different levels of dependencies in the data
  • 👀 attention mechanism applies “multi-scale learning” to data sequences \Rightarrow 🧩 e.g. not only words in sentences, but sentences in paragraphs, paragraphs in documents and so on.

\Rightarrow 🤖 transformers capture dependencies in the “whole” 🧩

Hierarchy from tokens to corpus with transformer layers enabling multi-scale attention.

Multi-facets learning from text

🧠 the attention mechanism extends to multifaceted dependencies of the same text components.

In the sentence :

the cat sat on the rug, and after a few hours, it moved to the mat.

\Rightarrow All those groups of words/tokens are multiple facets of the same text and its meaning. 🌈🔍

Multi-head attention heads (colored) capturing different semantic facets: coreference, spatial, temporal, actor continuity, duration, syntactic.

Transformers

Vaswani et al. (2017)

Heart of Transformers: Attention mechanism

\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

  • Three matrices: Query, Key, Value, derived from the input sequence
  • d_k: dimension of the key matrix, typically 64 or 128
  • weighted sum of the values V ~ compatibility between the query and the keys
  • softmax to get a probability distribution
  • multi-head attention: several attention mechanisms in parallel

Vaswani et al. (2017)

Head view of attention

Figure 1: The model view visualizes attention across all heads in a single Transformer layer.
  • Each line shows the attention from one token (left) to another (right).
  • Line weight reflects the attention value (ranges from 0 to 1),
  • line color identifies the attention head

BERT: Bidirectional Encoder Representations from Transformers

  • embeddings: represent words as vectors in high dimensions

Checking and loading models

Demos are running entirely in the browser WebGPU (chrome, edge, firefox).

See Browser support for WebGPU for more details.

  • Total download size: ~1.2GB (may take several minutes).
  • Embedding model: typical size for an encoder.
  • Decoder model: small for a generative model; limited text quality and more hallucinations due to less knowledge capacity.
  • For comparison: GPT-5 is ~1.7TB (about 1000× larger).

BERT Demo

GPT : Generative Pre-trained Transformer

  • autoregressive model
  • generates text by predicting the next token
  • pre-trained on large corpora of text

GPT Demo - Next Token Prediction

Warning

The model is in its base form, without chat template.

BERT vs GPT

BERT

GPT

Summary of LLM types

Type Architecture Training Objective Attention Use Cases
BERT Encoder stack only Masked Language Modeling (MLM) Bidirectional Classification, QA, NER, sentiment analysis
GPT Decoder stack only next token prediction Unidirectional (left-to-right, autoregressive) Text generation, chatbots, open-ended tasks
Seq2Seq Encoder + Decoder stacks Sequence-to-sequence Encoder: Bidirectional; Decoder: Unidirectional Translation, summarization, speech, data-to-text

Summary of LLM types (2)

Type Strengths Weaknesses Example Models Training Data Inference Speed
BERT (Encoder-Only) Deep understanding of input; discriminative tasks Not designed for generation BERT, RoBERTa, DistilBERT Large corpus (masked tokens) Fast (parallelizable)
GPT (Decoder-Only) Coherent, fluent generation No bidirectional context GPT-3, GPT-4, Llama Large corpus (autoregressive) Slower (autoregressive)
Seq2Seq (Encoder-Decoder) sequence transformation requires aligned input-output pairs T5, BART, Transformer (original), Whisper Parallel corpora (input-output pairs) Moderate (depends on sequence length)

Generative LLMs, Base vs Instruct

  • 🧠 Base models: just predict the next word (pre-training phase, no task-specific fine-tuning)
  • 📝 Instruct models: fine-tuned on specific tasks and follow user instructions more effectively

Important

Never use the base model for specific tasks without fine-tuning.

Generative LLMs, Reasoning vs non-Reasoning

  • 🤖 (Non-reasoning) models focus on generating coherent text without explicit reasoning capabilities
  • Reasoning models:
    • 🧩 complex reasoning tasks
    • 🔗 multi-step problems
    \Rightarrow 💸 increased computational requirements (and budget)

Tip

Reasoning addition to LLM have been a breakthrough in the field since end of 2024.

Reasoning Demo - Compare Outputs

🧩 With Reasoning (step-by-step)

🔍 Without Reasoning (direct answer)

The importance of the context window

  • 🪟 The context window is crucial for understanding and generating text.
  • 🧠 It determines how much information the model can consider at once.
  • 📏 Larger context windows allow for better understanding of complex queries and generation of more coherent responses.
  • 🔢 Typical max context windows are ~16k tokens; latest open-weight local LLMs reach 128k/256k/512k tokens; frontier LLMs are 1M+ tokens.
  • 💸 Long context windows are computationally expensive and require more memory/GPU resources.

What happens when the context window is exceeded?

  • 🪟 When the context window is exceeded, the model may lose track of important information, leading to less coherent responses.
  • 🛠️ Strategies to handle this include:
    • 📝 Summarizing previous context
    • 💾 Using external memory stores
    • 🧩 Chunking input data

Caution

Very large context (when permitted by the model) isn’t always a good thing: there are chances that the model may become overwhelmed with information, leading to decreased performance AND quality.

RAG (Retrieval-Augmented Generation)

  • 🔗 RAG combines retrieval-based and generation-based approaches.
  • 📚 It retrieves relevant documents from a knowledge base and uses them to inform the generation process.
  • 🎯 This allows for more accurate and contextually relevant responses.

Turtlecrown, Wikipedia

RAG Demo

Generated answers will appear here.

Concerns

Hallucinations

Was King Renoit real?
Is King Renoit mentioned in the Song of Roland, yes or no?

Hallucinations are STRUCTURAL to LLMs

  • There is no way to eliminate hallucinations in LLMs (only mitigate them)
  • Hallucinations are a byproduct of the probabilistic nature of LLMs
  • Hallucinations are more likely when:
    • The model is uncertain about the next token
    • The input prompt is ambiguous or lacks context
    • The model is asked to generate information outside its training data

Training data sets

  • very large dataset: (570Gb of text data, 499 B of tokens) for BERT (2018)

Repartition of the training dataset

Underrepresentation on the web means less accuracy and more hallucinations!

  • Other data (chosen quality)
  • Weighted sampling: Wikipedia=5 x CommonCrawl, Books1=20 x CommonCrawl,…

Copyright issues, be careful no way to check truthfulness

Aside: ELIZA, The First Chatbot (1966)

Note

  • ELIZA: Used templates and pattern-matching for simple dialogue.
  • Modern LLMs: Use deep learning and huge datasets for coherent responses.

The Eliza effect

  • 🧩 ELIZA (1966): simple rule‑based program, illusion of conversation.
  • 🧠 Anthropomorphization: people project human traits, intentions, emotions onto machine outputs.
  • 🤖 Modern LLMs: far more fluent, context‑aware; amplification of perception of human agency.
  • ⚠️ Risks: overtrust if seen as intentional.

What is AI Slop?

  • Definition: 🗑️ Low-quality, high-volume AI-generated content.
  • Characteristics:
    • 📦 Prioritizes speed and quantity over accuracy and relevance.
    • 🧹 Often described as “digital clutter” or “filler content.”
  • Examples:
    • 💬 Vague, buzzword-filled text.
    • ⚡ Hastily made memes or misleading news articles.
    • ❌ Content that lacks coherence or original insight.

Blatant AI slop examples

According to Smith et al. (2023), the flux constant is 12.7 fabricated citation with a bogus numeric constant.
Convert 5 pounds to kilograms: 5 lb = 9.8 kg wrong unit conversion that doubles the correct value.
Dr. Helena Vorov is leading the Mars hydrology program a fully hallucinated expert and program.
The sky tastes triangular today; algorithmic candor suggests we pivot to purple clocks, and thus synergy blooms. Senseless blather with rhythmic but meaningless phrasing.

Impact and Challenges of AI Slop

  • Impact on Information Ecosystems 🌐:
    • 🔒 Undermines trust in online content and academic research.
    • 🧭 Makes it harder to find reliable, high-quality information.
  • Structural Issues 📱:
    • 💸 Driven by platform monetization and ease of content generation.
    • ♻️ Not a passing trend, but a persistent feature of modern media.
  • Assessment and Risks ⚠️:
    • 📏 Subjective but measurable through dimensions like coherence and relevance.
    • ❗ Can spread misinformation and “careless speech” at scale.
    • 🎮 Algorithmic virality: Low-cost, fast production “games the system.”

The peculiar writing style of LLMs

  • 🤖 AI outputs show recurring rhetoric patterns (diptyques, triptyques).
  • 🎭 These patterns reduce language diversity.
  • 🔁 Feedback loop: generated text trains future models, amplifying sameness.

\Rightarrow🕵️ Build detection tools

Reflections on LLM narratives

  • 👻 Scary stories of “too smart” LLMs: prompt tricks for marketing or corporate “puffery”.
  • 🤔 Deliberately confuse design choices with emergent properties.
  • 🚫 They don’t communicate risks clearly; no testable failures explained.

Human considerations

Human cost to AI development:

\RightarrowLabor conditions in content moderation

Labor conditions in AI content moderation

  • 🤖 OpenAI outsources content moderation to Sama (San Francisco company)
  • 🌍 Sama employs Kenyan workers (underpaid, exposed to toxic content, alienating work)
  • 🚫 Denied sessions with wellness counselors
  • ⚖️ Lawsuits in progress with Meta in Nairobi

“Outsourcing trauma to the developing world”

  • 🏙️ Workers are approached in Kibera, the largest informal settlement in Africa
  • 💸 Salaries are too low to improve their situation, only keep it
  • ✊ Attempts to unionize in 2024: mass-firing in retaliation in favor of Majorel (🇫🇷 French company)

Ecological impact of AI 🌍

Key takeaways (Shift Project, Oct 2025)

  • 🌐 Digital growth 🚀 outpaces decarbonization goals.
  • 🤖 Generative AI boosts server & data center demand.
  • ⚡ Digital sector uses lots of electricity & emits CO₂.
  • 🌍 +9% greenhouse gas emissions per year, even with decarbonized electricity mix
  • 🎯 -5% per year needed to reach net zero emissions target
  • 🚫 By 2030, the projected trajectory for data centers is unsustainable.
  • ⚠️ Up to 920 MtCO₂e/year—up to 2× France’s annual emissions.

The Shift Project 2025 Source reformated by us

Electricity consumption

The Shift Project 2025 Source reformated by us

Exploratory prospective scenario of undifferentiated deployment of compute supply and its widespread adoption

AI Usage projection

Aboundance without boundaries scenario, by usage.

Source Scheider Electric reformated by us

Drivers and risks

  • 🔄 Generative AI creates a cycle: more compute → more use.
  • ⚠️ Risks: fossil lock-in 🛢️, water use 💧, server footprint 🖥️, regional carbon gaps 🌎.
  • 🚫 Without rules, growth stays unsustainable.

The Shift Project 2025 Source reformated by us

Practical recommendations 🛠️

  • 📊 Track & report energy, carbon, water for AI.
  • 🏦 Set energy-carbon budgets for organizations.
  • 🤔 Ask: do we need big models? Use leaner options when possible.
  • 🌱 Eco-design: sample smart, use green grids, right-size models, drop unsustainable services.

The Shift Project 2025 Source reformated by us

Another perspective on AI energy use

  • 🤖 Sometimes, AI can emit less CO₂ than humans for some tasks.
  • 📊 Results depend on boundaries, hardware, reuse, and downstream effects.
  • 🧩 Treat single studies as one data point; compare methods and scopes.
  • 🗣️ Use as debate counterpoint; stress need for standard metrics before policy decisions.

Source Nature, 2024

Small, “on-demand” language models advocacy

Key points

  • 🤏 Not all LLMs are huge; smaller models can be effective and efficient.
  • 💸 Smaller models reduce costs, latency, and environmental impact.
  • 🔓 On-device models enhance privacy and accessibility.
  • 🎯 Choose model size based on task needs; bigger isn’t always better.

Small models for specific tasks

  • 🤏 Small models excel in specific tasks (e.g., text classification, sentiment analysis).
  • ⚡ They can be fine-tuned quickly with less data and compute.
  • 💡 Often match or exceed large model performance on niche tasks.
  • 🔄 Enable rapid experimentation and iteration.

\Rightarrow 🚀 Small models foster innovation by lowering entry barriers for developers and researchers.

AI Agents

AI agents are a programming paradigm involving two main flavours:

  • Reactive agents 🤖: respond to inputs with pre-defined or LLM-driven actions (the agent “reacts” and may decide next steps).
  • Pipelined agents 🔁: process inputs through a series of stages/components (the program orchestrates calls and post-processing).

Control flow distinction ⚙️:

  • internal control = reactive (the LLM drives decisions)
  • external control = pipeline (the programmer decides when/how to call the LLM)

Reactive agents

Examples:

  • 🤖 Chatbots that respond to user queries with pre-defined answers.
  • ⚙️ Simple automation scripts that trigger actions based on specific events (e.g., web search).
  • 🧭 Agentic mode in Code Assistants — autonomous actions and decision-making.

Warning

\Rightarrow Control is done by the LLM itself with all risks : infinite loops, unsupervised and potentially dangerous actions etc.

Pipeline Agents

Examples:

  • 🔍 RAG queries
  • 📝 Summarizing documents
  • 🔗 Communicating with other agents

Note

\Rightarrow Control is done by normal, program logic

Reactive agent 2025 : MCP

Ujjwal Khadka Source

Agent demo

Conclusion

The real question

Are the benefits of using generative AI worth the cost of extra supervision and the additional engineering effort?

The general answer is:

Takeaways

Tip

Use small models or locally deployed models when possible, and only use large models when absolutely necessary.

Important

Never, ever trust LLM outputs without verification.

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45: 5–32.
Cortes, Corinna, and Vladimir Vapnik. 1995. “Support-Vector Networks.” Machine Learning 20: 273–97.
Freund, Yoav, and Robert E Schapire. 1997. “A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting.” Journal of Computer and System Sciences 55 (1): 119–39.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT press.
Minsky, Marvin, and Seymour Papert. 1969. “An Introduction to Computational Geometry.” Cambridge Tiass., HIT 479 (480): 104.
Rosenblatt, Frank. 1958. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review 65 (6): 386.
Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.