Everything you wanted to know about Deep Learning but were afraid to ask
2024-02-01
Artificial Intelligence (AI)
Machine Learning (ML)
Deep Learning (DL)
Large Language Models (LLM)
More or less theoretical guarantees
Myriad of ad-hoc choices, engeenering tricks and empirical observations
Current choices are critical for success: what are their pros and cons?
Try \rightarrow Fail \rightarrow Try again is the current pipeline
Criticizing an entire community (and an incredibly successful one at that) for practicing “alchemy”, simply because our current theoretical tools haven’t caught up with our practice is dangerous. Why dangerous? It’s exactly this kind of attitude that lead the ML community to abandon neural nets for over 10 years, despite ample empirical evidence that they worked very well in many situations. (Yann LeCun, 2017, My take on Ali Rahimi’s “Test of Time” award talk at NIPS.)
Also, on hardware side:
shape=(batch, height, width, features)
\Rightarrow input can be anything: images, videos, text, sound, …
f(x)=\nabla\frac{x_{1}x_{2} sin(x_3) +e^{x_{1}x_{2}}}{x_3}
\begin{darray}{rcl} x_4 & = & x_{1}x_{2}, \\ x_5 & = & sin(x_3), \\ x_6 & = & e^{x_4}, \\ x_7 & = & x_{4}x_{5}, \\ x_8 & = & x_{6}+x_7, \\ x_9 & = & x_{8}/x_3. \end{darray}
Example with a non-convex function
f(x_1, x_2) = (x_1^2 + x_2 - 11)^2 + (x_1 + x_2^2 - 7)^2
f = ([x1, x2]) => (x1**2 + x2 - 11)**2 + (x1 + x2**2 - 7)**2;
{
const linspace = d3.scaleLinear().domain([0, 49]).range([minX, maxX]);
const X1 = Array.from({length: 50}, (_, i) => linspace(i));
const X2 = Array.from({length: 50}, (_, i) => linspace(i));
// Define your function f here
const f = ([x1, x2]) => (x1**2 + x2 - 11)**2 + (x1 + x2**2 - 7)**2;
const Z = X1.map((x1,i) => X2.map((x2,j) => f([x1,x2])));
const data = [{
x: X1.flat(),
y: X2.flat(),
z: Z,
type: 'surface'
}];
const layout = {
title: '',
autosize: false,
width: 500,
height: 500,
paper_bgcolor: "rgba(0,0,0,0)",
plot_bgcolor: "rgba(0,0,0,0)",
template: 'plotly_dark',
margin: {
l: 65,
r: 50,
b: 65,
t: 90,
}
};
const div = document.createElement('div');
Plotly.newPlot(div, data, layout,{displayModeBar: false});
return div;
}
function grad_descent(x1,x2,step,max_iter) {
let grad = f_grad(x1, x2);
let iterations = [[x1, x2]];
function f_grad(x1, x2) {
let df_x1 = 2 * (-7 + x1 + x2**2 + 2 * x1 * (-11 + x1**2 + x2));
let df_x2 = 2 * (-11 + x1**2 + x2 + 2 * x2 * (-7 + x1 + x2**2));
return [df_x1, df_x2];
}
var count = 0;
while (count < max_iter) {
x1 -= step * grad[0];
x2 -= step * grad[1];
grad = f_grad(x1, x2);
if (isFinite(x1) && isFinite(x2) &&
(minX < x1) && (x1 < maxX) &&
(minX < x2) && (x2 < maxX))
iterations.push([x1, x2]);
else iterations.push(iterations[count])
count += 1
}
return iterations;
}
viewof descent_params = Inputs.form({
x1: Inputs.range([minX, maxX], {step: 0.1, value: 0, label: 'x1'}),
x2: Inputs.range([minX, maxX], {step: 0.1, value: 0, label: 'x2'}),
step: Inputs.range([0.001, 0.04], {step: 0.001, value: 0.01, label: 'step_size'})
})
{
var iterations = grad_descent(descent_params.x1,descent_params.x2,descent_params.step,20)
return Plot.plot({
aspectRatio: 1,
x: {tickSpacing: 50, label: "x1 →"},
y: {tickSpacing: 50, label: "x2 →"},
width: 400,
style: {
backgroundColor: 'rgba(0,0,0,0)'
},
marks: [
Plot.contour({
fill: (x1, x2) => Math.sqrt((x1**2 + x2 - 11)**2 + (x1 + x2**2 - 7)**2),
x1: minX,
y1: minX,
x2: maxX,
y2: maxX,
showlegend: false,
colorscale: 'RdBu',
ncontours: 30
}),
Plot.line(iterations,{marker: true})
]
})
}
Sensitivity to initial point and step size
\theta_{k+1} \leftarrow \theta_k - \frac{\eta}{n}\sum_{i\in\text{batch}}\nabla_\theta \mathcal{L}(f_\theta(x_i), y_i)
\Rightarrow No general guarantees of convergence in DL setting
SGD, Adam, RMSProp
women
woman
window
widow
widow
women
woman
window
[CLS] 101
we 2,057
need 2,342
to 2,000
book 2,338
a 1,037
flight 3,462
, 1,010
and 1,998
we 2,057
need 2,342
a 1,037
book 2,338
to 2,000
pass 3,413
time 2,051
, 1,010
also 2,036
book 2,338
an 2,019
hotel 3,309
! 999
[SEP] 102
First 5 vector values for each instance of "project".
book a flight: tensor([ 2.7359, -6.4879, 0.6554, 0.4170, 6.0187])
need a book: tensor([ 3.3611, 1.1988, 3.2118, -0.8919, 5.3709])
book an hotel: tensor([ 3.2382, -0.8284, 1.4804, -0.7448, 5.4106])
Vector similarity for *similar* meanings: 0.82
Vector similarity for *different* meanings: 0.59
Previously, recurrent networks limited in sequential dependencies,
\Rightarrow transformers capture dependencies in the “whole” in parallel (much faster)
Vaswani et al. (2017)
\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
Vaswani et al. (2017)
Was King Renoit real?
Is King Renoit mentioned in the Song of Roland, yes or no?
{
var data = [{
values: [3, 8, 7, 22, 60],
labels: ["wikipedia", "Books1", "Books2", "Webtext2", "Common Crawl"],
textinfo: "label+percent",
type: "pie",
marker: {
colors: ["lightcyan", "cyan", "royalblue", "darkblue", "gold"]
}
}];
var layout = {
template: 'plotly_dark',
paper_bgcolor: "rgba(0,0,0,0)",
plot_bgcolor: "rgba(0,0,0,0)",
font: {
size: 26,
color: "white"
},
margin: {"t": 0, "b": 0, "l": 0, "r": 0},
showlegend: false
};
const div = document.createElement('div');
Plotly.newPlot(div, data, layout,{displayModeBar: false});
return div;
}
Repartition of the training dataset
Underrepresentation on the web means less accuracy and more hallucinations!
Copyright issues, be careful no way to check truthfulness
Impactful tool, with limitations and ethical challenges
lack of theoretical understanding, trial and error only
\Rightarrow engineering ad-hoc solutions, giant panels of knobs to turn
race to performance impacting reviews quality and content
\Rightarrow replacing domain specialists and researchers
\Rightarrow in short, as long the scaling is enough, there is a sensitive decoupling between the size of the dataset and the size of the model, and the risk of overfitting is mitigated (which is a big overcome of DL over classical ML in those datasets).
Very successful in tabular data (structured data), but also some standardised data (like MNIST). Used almost in every DL model as the last layers before output.
Immensely successful in computer vision.
For sequences (first DL models for NLP and speech recognition).
Encodes graph structure (nodes, edges, global) into embedding vectors
Use those vectors as input to a network.
Attention mechanism \Rightarrow breakthrough of protein folding prediction with AlphaFold of DeepMind.
Jumper et al. (2021)
[generative adversarial network for celebrity faces] (https://towardsdatascience.com/generative-adversarial-network-gan-for-dummies-a-step-by-step-tutorial-fdefff170391)
Tokenize the word: tokenizer
['token', '##izer']
pipeline to train a representation model (like BERT):
tokenize text \rightarrowmap token to a unique id \rightarrow map id to randomized initial vector \rightarrowtrain
BERT training set=books + wikipedia: Word completion + next sentence prediction
ChatGPT, Myths and RealitiesApprentissage Statistique et IA 4