Calculus is the study of rates of change, and without it, modern machine learning would not be possible. In many ways, machine learning is about optimizing quantities – making the best possible predictions, or making prediction errors as small as possible – and calculus is the tool that enables us to perform this optimization.
We’ll address these lofty goals throughout the semester. Here, we will review key ideas from a first course in calculus (e.g. Math 115).
Suppose f is a function that takes in a single real number and outputs a single real number, i.e. f:R→R.
If f is a function, then the derivative of f is another function, sometimes denoted f′, such that f′(x) is the “instantaneous” rate of change of f at the input x.
To understand what I mean by “instantaneous” rate of change, let’s consider an example. Suppose f(x)=41x2−3. The graph of f is shown below, in blue, along with a slider for input values of x. Drag the slider.
At any point x, the tangent line to the graph of f is the best linear approximation of the graph of f near x.
For instance, consider when x=−4. At x=−4, f(−4)=41(−4)2−3=1.
from utils import plot_functions
import numpy as np
# Define the function and its derivative
f = lambda x: (1/4)*x**2 - 3
f_prime = lambda x: (1/2)*x
# Create the plot with both the function and tangent line at x = -4
x_range = (-6, 6)
y_range = (-6, 6)
# Calculate tangent line points
x_point = -4
y_point = f(x_point)
slope = f_prime(x_point)
# Create points for tangent line (extending 2 units in each direction)
x_tangent = [x_point - 0.5, x_point + 0.5]
y_tangent = [y_point + slope * (-0.5), y_point + slope * 0.5]
# Plot both the function and tangent line
fig = plot_functions(
[f, lambda x: np.where((-5.34 < x) & (x < -2.65), slope * (x - x_point) + y_point, None)],
['f(x)', 'Tangent line'],
[x_point],
x_range=x_range,
y_range=y_range
)
fig.update_layout(width=600, height=450, title="Tangent line at x = -4")
fig.show(renderer='notebook')
Loading...
The tangent line at x=−4 is the line that passes through the point (−4,1) that best approximates f near x=−4, among all other lines that pass through (−4,1). In the plot above, use your mouse to zoom in on the point (−4,1), and you’ll see that when zoomed in, the tangent line and original function are very difficult to distinguish.
This intuitive definition of the derivative – as being the slope of the tangent line – is the most important way to think about the derivative. The formal definition is also important, and we’ll get there next, but you should have enough context for this first activity.
Activity 1
Let f(x)=41x4+31x3−x2+2.
Given that f′(x)=x3+x2−2x, find the equation of the tangent line to f(x) at the following points:
x=−3
x=−1
x=1
As a bonus exercise, try to verify that the provided formula for f′(x) is correct, using your knowledge of derivatives from Calculus 1.
Let’s review the more formal definition of the derivative. First, remember the general formula for the slope of the line between two points (x1,y1) and (x2,y2):
x2−x1y2−y1
Let’s say we’re trying to find the slope of the tangent line at x=a, which is the line that passes through the point (a,f(a)) whose slope is the instantaneous rate of change of f at x=a. To find that instantaneous rate of change, we can find the slope of the line between (a,f(a)) and some other point (b,f(b)), where b−a is as close to 0 as possible.
In the example below, as b approaches a=1, the slope of the line between (1,f(1)) and (b,f(b)) approaches the slope of the tangent line at x=1. Note that the formal name for the line between any two points on a function is a secant line.
from utils import plot_multiple_secant_lines
f = lambda x: (1/4)*x**2 - 3
x_range = (-6, 6)
y_range = (-6, 6)
# Define the three secant line pairs
secant_points = [
(1, 5), # Secant between 1 and 5
(1, 2), # Secant between 1 and 2
(1, 1.1) # Secant between 1 and 2.1
]
fig = plot_multiple_secant_lines(f, secant_points, x_range, y_range)
fig.update_layout(width=700, height=300)
fig.show(scale=3, renderer='png')
Let’s be more precise, using the idea of a limit from Calculus 1. Recall, x→alimg(x)=L is pronounced “the limit of g(x) as x approaches a is L”. If x→alimg(x)=L, then as x gets closer and closer to a, g(x) gets closer and closer to L. (Intuitively, you might think this must always mean that g(a)=L, but that’s not always the case.)
The slope of the tangent line at x=a is the limit of the slope of the line between (a,f(a)) and (b,f(b)) as b approaches a:
slope of tangent line at a=b→alimb−af(b)−f(a)
This definition alone can help us compute some common derivatives. For instance, if f(x)=x2, then the slope of the tangent line at x=a is:
Here, I used the difference of squares formula, b2−a2=(b−a)(b+a). To find the slope of the tangent line using the limit definition, you’ll need to use some sort of algebraic manipulation to simplify the expression, because the limit of the denominator is 0, and we can’t divide by 0.
Instead of thinking of the slope of the tangent line as the limit of the slope of the line between the points (a,f(a)) and (b,f(b)) as b→a, a more general (and equivalent!) definition of the derivative is the limit of the slope of the line between the points (x,f(x)) and (x+h,f(x+h)) as h→0.
I will not use this formal definition much in this class, but it’s good to understand where it comes from and why it works.
There are two equivalent notations for the derivative: dxdf(x) and f′(x). I used the notation f′(x) in the previous section since it’s easier to write and more commonly used in calculus courses. However, I’ll use the notation dxdf(x) from now on, as it’ll make the transition to multivariable calculus more natural when we get there.
Often, for brevity, I will drop the (x) and just write dxdf. As an example, suppose g(x)=sin2(x)+3log(x), where log(⋅) is the natural logarithm (with base e). Then, the derivative of g is:
dxdg=2sin(x)cos(x)+x3
dxdg (equivalently, dxdg(x)) is a function, not a number. To get a number as an output, we need to plug in a value for x. For example, dxdg(π) is the number (π3) corresponding to the slope of the tangent line to g at x=π.
To actually find dxdg, we used several derivative rules, which we’ll now review.
None of the five key rules directly specify how to take the derivative of an expression like 4x5, and even though you may look at this expression and see that its derivative is 20x4, it’s well worth our time to review the rules carefully.
Let’s split 4x5 into two separate functions, 4 and x5, and then use the product rule.
dxd4x5=4(dxdx5)+(dxd4)x5=4⋅5x4+0=20x4
While applying the product rule, I also used the power rule to differentiate x5, and the constant rule to differentiate 4.
With this in mind, we can return to the goal of differentiating f(x).
Note that f(x) can also be written as f(x)=(x2−1)−1, which looks like something we can use the power rule on. To fully apply the power rule, we’ll need to use the chain rule, too.
dxd(x2−1)−1=power rule−1⋅(x2−1)−2⋅required by the chain ruledxd(x2−1)=−1⋅(x2−1)−2⋅2x=−(x2−1)22x
It looks like f(x) is a quotient of two functions, x2+1 and x2−1, but I intentionally did not introduce a quotient rule – it’s unnecessary, and can be replicated using the product rule and chain rule, since f(x) can be written as a product:
f(x)=x2−1x2+1=(x2+1)⋅x2−11
You’ll notice that we’ve already found the derivative of x2−11 in Example 2, so we can use that result here while applying the product rule.
dxdf=dxd((x2+1)⋅x2−11)=product rule(dxd(x2+1))⋅x2−11+(x2+1)⋅(dxdx2−11)=2x⋅x2−11+(x2+1)⋅from Example 2(−(x2−1)22x)=x2−12x−(x2−1)22x(x2+1)=common denominator(x2−1)22x(x2−1)−2x(x2+1)=(x2−1)22x(x2−1−x2−1)=(x2−1)2−4x
The chain rule is extremely pervasive in machine learning, which is why I’ve included examples like the one above. This site contains dozens more examples of the chain rule in practice.
Activity 2
Note that the activities in this section are quite challenging, so make sure you’ve attempted and fully understood the examples above first.
Activity 2.1
An important function in machine learning is the sigmoid function, which is defined as:
σ(x)=1+e−x1
σ(x) has a nice S-shape, and is used for predicting probabilities.
Find the derivative of σ(x), and show that it satisfies the following property:
dxdσ(x)=σ(x)(1−σ(x))
Activity 2.2
Find the derivative of each of the following functions:
f(x)=sin(4πx)
g(x)=(2x+1)3x (Hint: Start by taking the natural logarithm of both sides, then take the derivative of both sides.)
Activity 2.3
Suppose x and y satisfy the following relationship:
x2=y3−11
Find an expression for dxdy that involves bothx and y. To do this, don’t solve for y in terms of x – instead, take the derivative of both sides of the equation with respect to x, use the power and chain rules, and re-arrange to isolate dxdy.
Find the slope of the tangent line to the curve at the point (x,y)=(4,3).
As stated at the start of this section, calculus is a tool for optimization – that is, finding the inputs that maximize or minimize a function. Let’s be more precise about what we mean by “maximize” and “minimize”.
Consider f(x)=41x4+31x3−x2+2, shown below.
from utils import plot_functions
import plotly
plotly.io.renderers.default = 'png'
f_list = [lambda x: (1/4)*x**4 + (1/3)*x**3 - x**2 + 2]
f_titles = ['f(x)=x^4/4+x^3/3-x^2+2']
crit_x = [-2, 0, 1]
crit_labels = ['global minimum', 'local maximum', 'local minimum']
x_range = (-5, 5)
y_range = (-2, 6)
fig = plot_functions(f_list, f_titles, crit_x, x_range, y_range)
fig.update_layout(width=600, height=400)
# Annotate f(x) for x in crit_x with the label (x, f(x))
f = lambda x: (1/4)*x**4 + (1/3)*x**3 - x**2 + 2
for i, x in enumerate(crit_x):
y = f(x)
fig.add_annotation(
x=x,
y=y + 0.1 * (1 if i % 2 == 0 else -1),
text=crit_labels[i],
showarrow=True,
arrowhead=1,
ax=0,
ay=[-50, 40, -50][i],
bgcolor="white",
bordercolor="black",
borderwidth=1
)
fig.show(scale=4)
Where is f(x) maximized and minimized?
At x=−2, f(x) is less than it is at all other inputs. This is means (−2,f(−2)) is a global minimum.
At x=0, f(x)looks like it is greater than all other inputs, but only if you restrict your attention to points near x=0. This means (0,f(0)) is a local maximum. (0,f(0)) is not a global maximum because there are plenty of points where f(x)>f(0) – they are just not immediately adjacent to x=0.
Similarly, (1,f(1)) is a local minimum.
f(x) does not have a global maximum, since f(x) approaches infinity as x increases beyond x=1 or decreases beyond x=−2. If we were to restrict the domain (or set of possible inputs) of f(x) to, say, [−3,3], then there would be a global maximum at x=3.
Note that we usually care more about the inputs that maximize or minimize a function, rather than the actual values of the function at those inputs. In the above example, the fact that x=−2 is a global minimum is important; the fact that f(−2)=−32 is not as important.
To recap:
Note that maxima is the plural of maximum, minima is the plural of minimum, and extrema refers to both maxima and minima.
Below, you’ll find several other examples of functions with varying amounts and types of extrema. Play close attention to the relationship between the two functions in the second row, h(x) and k(x). k(x) results from stretching h(x) vertically and shifting it up vertically, and has the same extrema.
from utils import plot_functions_grid
import numpy as np
import plotly.graph_objects as go
f_0 = lambda x: x**3 + 1
f_1 = lambda x: -x**2
f_2 = lambda x: x**3 - 3*x
f_3 = lambda x: (1 / 4) * f_2(x) + 3
f_4 = lambda x: x**4 - 4*x**2
f_5 = lambda x: np.exp(x)
f_list = [f_0, f_1, f_2, f_3, f_4, f_5]
f_titles = [r"$f(x)=x^3+1$",
r"$g(x)=-x^2$",
r"$h(x)=x^3-3x$",
r"$k(x)=\frac{1}{4}(x^3-3x)+3 = \frac{1}{4}{\color{#d81b60}h\color{#d81b60}(\color{#d81b60}x\color{#d81b60})} + 3$",
r"$l(x)=x^4-4x^2$",
r"$m(x)=e^x$"]
x_range = (-4, 4)
y_range = (-6, 6)
# Compute extrema for each function (within x_range)
extrema = [
# f_0: y = x^3 + 1, monotonic, no local/global extrema
[("no extrema!", 1, -1)],
# f_1: y = -x^2, global max at x=0, global min at x=-4 and x=4 (endpoints)
[("global maximum", 0, f_1(0)), ("global minimum", -4, f_1(-4)), ("global minimum", 4, f_1(4))],
# f_2: y = x^3 - 3x, local max at x=-1, local min at x=1
[("local maximum", -1, f_2(-1)), ("local minimum", 1, f_2(1))],
# f_3
[("local maximum", -1, f_3(-1)), ("local minimum", 1, f_3(1)), (r"$\text{extrema at same positions as in } h(x)=x^3-3x$", 1, -1)],
# f_4: y = x^4 - 4x^2, local max at x=0, local min at x=-2,2
[("local maximum", 0, f_4(0)), ("local minimum", -np.sqrt(2), f_4(-np.sqrt(2))), ("local minimum", np.sqrt(2), f_4(np.sqrt(2)))],
# f_5: y = exp(x), global min at x=-4 (endpoint), no max
[("no extrema!", 1, -1), ("approaches 0 but never reaches it", -2, 1)],
]
fig = plot_functions_grid(f_list, f_titles, x_range=x_range, y_range=y_range, rows=3, cols=2)
# Annotate extrema
for i, extrema_list in enumerate(extrema):
for label, x_val, y_val in extrema_list:
# Add annotation
fig.add_annotation(
x=x_val, y=y_val + 0.2 * (-1 if "min" in label else 1),
text=label,
showarrow=True if "mum" in label else False,
arrowhead=1,
ax=0,
ay=20 if "min" in label else -20,
bgcolor="white",
bordercolor="black",
borderwidth=1,
font=dict(size=12),
xref=f"x{i+1}",
yref=f"y{i+1}",
)
# Add marker for actual extrema points
if "no extrema" not in label and "approaches" not in label:
fig.add_trace(
go.Scatter(
x=[x_val],
y=[y_val],
mode='markers',
marker=dict(
color=['#3d81f6', 'orange', '#d81b60', '#004d40', '#6f42c1', '#20c997'][i],
size=6,
symbol='circle'
),
showlegend=False
),
row=(i // 2) + 1,
col=(i % 2) + 1
)
fig.update_layout(showlegend=False, width=700, height=900).show(scale=3, renderer='png')
Activity 3
Let q(x)=8x4−4x. q(x) has a global minimum at (21,−23).
For each of the following functions, find all extrema, and specify whether each extremum is a local maximum, global maximum, local minimum, or global minimum. Make sure to specify both the x-values and the y-values of each extremum.
f(x)=2q(x)+10
g(x)=−10q(x)
h(x)=∣∣q(x)∣∣
Finding the extrema of l(x)=q(x)2 is a bit more complicated than in the examples above. Why?
You should notice that the derivative is 0 at all three extrema we identified earlier – the global minimum at x=−2, the local maximum at x=0, and the local minimum at x=1. Intuitively, the derivative is 0 at a maximum or minimum because the tangent lines at these points are horizontal (with slope 0), as the function is neither increasing nor decreasing at these points.
In the region between x=−2 and x=0, the derivative is positive, meaning the function is increasing.
Solving for the inputs that make the derivative 0 – i.e., finding the critical points – is a necessary, but not sufficient, step. If all we know is that the derivative is 0 at a point, we don’t know whether the point is a maximum or minimum. It may not be either, such as in the case of f(x)=x3, which has a critical point at x=0 that is neither a maximum nor a minimum.
To be able to determine whether a critical point of f(x) is a maximum or minimum, we need to look at the second derivative of f(x). If the (first) derivative of f(x) is a function that describes the rate at which f(x) is changing, the second derivative – denoted dx2d2f – is a function that describes the rate at which the derivative is changing.
Physics provides us with an analogy that helps us understand the role of the second derivative. Suppose you’re driving down a straight road, and s(t) is your position on the road at time t, relative to your starting point (so a negative value of s(t) means you’ve moved backwards).
Then, v(t)=dtds is your velocity (the rate at which your position is changing) and a(t)=dt2d2s is your acceleration (the rate at which your velocity is changing).
If dtds>0 and dt2d2s=0, you are moving forward at a constant speed (say, on cruise control).
If dtds>0 and dt2d2s>0, you are moving forward and your speed is increasing (you are accelerating).
If dtds>0 and dt2d2s<0, you are moving forward, but your speed is decreasing, and eventually, your car will come to a halt.
Cases where dtds<0 correspond to driving backwards!
Activity 4
Give real-world examples (similar to those provided above) for each of the following scenarios in the context of the physics analogy:
dtds<0 and dt2d2s>0
dtds<0 and dt2d2s<0
dtds=0 and dt2d2s<0
Let’s put this in the context of our running example, f(x)=41x4+31x3−x2+2. The second derivative of f(x) is:
dx2d2f=dxd(dxd(41x4+31x3−x2+2))=dxd⎝⎛first derivative of f(x)x3+x2−2x⎠⎞=3x2+2x−2
The second derivative, dx2d2f, is a function, not a number. What does the second derivative look like, relative to the original function and first derivative?
from utils import plot_functions_grid
f_list = [
lambda x: (1/4)*x**4 + (1/3)*x**3 - x**2 + 2,
lambda x: x**3 + x**2 - 2*x,
lambda x: 3*x**2 + 2*x - 2
]
f_titles = [
r'$\text{Original Function, } f(x)$',
r'$\text{First Derivative, } \frac{\text{d}f}{\text{d}x}$',
r'$\text{Second Derivative, } \frac{\text{d}^2f}{\text{d}x^2}$'
]
crit_x = [-2, 0, 1]
x_range = (-6, 6)
y_range = (-3, 7)
fig = plot_functions_grid(
f_list=f_list,
f_titles=f_titles,
rows=1,
cols=3,
x_range=x_range,
y_range=y_range,
title=None,
show_axis_labels=True,
xaxis_title='',
yaxis_title=''
)
# Mark critical points only on the first two plots (original and first derivative)
import numpy as np
for i in range(3):
y_crit = f_list[i](np.array(crit_x))
fig.add_trace(
go.Scatter(
x=crit_x, y=y_crit, mode='markers',
marker=dict(color=['#3d81f6', 'orange', '#d81b60', '#004d40', '#6f42c1', '#20c997'][i], size=8),
showlegend=False
),
row=1, col=i+1
)
# Add extra top margin so titles are not cut off
fig.update_layout(width=900, height=350, showlegend=False, margin=dict(t=50, l=20, r=20, b=0))
fig.show(scale=3, renderer='png')
f(x) is a polynomial of degree 4, dxdf is a polynomial of degree 3, and dx2d2f is a polynomial of degree 2 – the degree drops by one each time, as a consequence of the power rule.
Recall that f(x) has critical points at x=−2, x=0, and x=1, which we’ve highlighted in all three plots above. Our goal is to determine an algebraic approach for determining whether these points are maxima, minima, or neither; we shouldn’t rely on the graph of f(x) alone, since we won’t always be able to see its graph.
At all three of these points, the first derivative is 0, meaning that the function is neither increasing nor decreasing at these points. But the second derivative dx2d2f=3x2+2x−2 gives us additional information:
At x=−2, dx2d2f(−2)=6, which is positive. So, at x=−2, f(x) is neither increasing nor decreasing, but is also “speeding up”, since the second derivative is positive. So, as we move to the right of x=−2, the slope of the tangent line will increase, causing the function to increase. If the function increases to the right of x=−2, then x=−2 must correspond to a local minimum of f(x).
At x=0, dx2d2f(0)=−2, which is negative. So, at x=0, f(x) is neither increasing nor decreasing, but is also “slowing down”. So, as we move to the right of x=0, the slope of the tangent line will decrease, causing the function to decrease. If the function decreases to the right of x=0, then x=0 must correspond to a local maximum of f(x).
At x=1, dx2d2f(1)=3 is positive, which, using the logic from the x=−2 case, means that x=1 also corresponds to a local minimum of f(x).
The sign of the second derivative is useful for more than just determining whether a critical point is a local maximum or minimum. Below, we’ve plotted f(x), along with annotations for the regions where the second derivative is positive and negative.
from utils import plot_functions
import numpy as np
import plotly.graph_objects as go
# Define the original function and its second derivative
f = lambda x: (1/4)*x**4 + (1/3)*x**3 - x**2 + 2
second_derivative = lambda x: 3*x**2 + 2*x - 2
x_range = (-6, 6)
y_range = (-3, 7)
# Find inflection points: roots of the second derivative
coeffs = [3, 2, -2]
inflection_points = np.roots(coeffs)
inflection_points = [float(x) for x in inflection_points if np.isreal(x) and x_range[0] <= x <= x_range[1]]
inflection_points.sort()
# Plot just f(x)
fig = plot_functions(
f_list=[f],
f_titles=[r'$\text{Original Function, } f(x)$'],
x_range=x_range,
y_range=y_range,
show_axis_labels=True,
xaxis_title='',
yaxis_title=''
)
# Shade regions: outside inflection points = pink, between = orange
x0, x1 = inflection_points
fig.add_vrect(
x0=x_range[0], x1=x0,
fillcolor="pink", opacity=0.15, line_width=0, layer="below"
)
fig.add_vrect(
x0=x0, x1=x1,
fillcolor="orange", opacity=0.15, line_width=0, layer="below"
)
fig.add_vrect(
x0=x1, x1=x_range[1],
fillcolor="pink", opacity=0.15, line_width=0, layer="below"
)
# Add vertical dotted lines at inflection points
for x0 in inflection_points:
fig.add_vline(
x=x0,
line=dict(color="black", width=2, dash="dot")
)
# Add text annotations for the three regions based on the sign of the second derivative
region_xs = [
(x_range[0] + inflection_points[0]) / 2 - 0.75,
(inflection_points[0] + inflection_points[1]) / 2,
(inflection_points[1] + x_range[1]) / 2
]
region_ys = [5, 5, 5]
region_signs = []
for x in region_xs:
val = second_derivative(x)
if val > 0:
region_signs.append("second<br>derivative<br><span style='color:#d81b60'>positive</span>")
elif val < 0:
region_signs.append("second<br>derivative<br><span style='color:orange'>negative</span>")
else:
region_signs.append("second<br>derivative<br>zero")
for x, y, text in zip(region_xs, region_ys, region_signs):
fig.add_annotation(
x=x,
y=y,
text=text,
showarrow=False,
font=dict(size=14, color="black"),
align="center"
)
# Hide x and y axis lines
fig.update_xaxes(showline=False, zeroline=False)
fig.update_yaxes(showline=False, zeroline=False)
# Add happy and sad face emojis in the "cups" (minima) and "bowls" (maxima)
# For this quartic, local minima at x = -2 and x = 1, local maximum at x = 0
minima_xs = [-2, 1]
maxima_xs = [0]
for x in minima_xs:
y = f(x)
fig.add_annotation(
x=x+0.05,
y=y+0.75,
text="😊",
showarrow=False,
font=dict(size=24)
)
for x in maxima_xs:
y = f(x)
fig.add_annotation(
x=x,
y=y-0.75,
text="😢",
showarrow=False,
font=dict(size=24)
)
fig.update_layout(width=600, height=350, showlegend=False, margin=dict(t=5, l=20, r=20, b=0))
fig.show(scale=3, renderer='png')
When the second derivative is positive, the function is concave opening up, also known as convex. You should think of convex functions as “bowl-shaped” or “smiling”. When the second derivative is negative, the function is concave opening down, or simply concave; the equivalent analogy is that concave down regions are “upside-down bowls” or “sad faces”.
From the perspective of finding local mimina, if a function is concave up at a critical point, then we must be at the bottom of a bowl – a local minimum – and if a function is concave down at a critical point, we must be at the top of a hill, corresponding to a local maximum.
If a function is concave up across its entire domain – unlike in the example above, but like in f(x)=x2 – then any local minimum must be a global minimum. Convexity is a hugely important concept in optimization and machine learning, and we’ll see it again in more detail throughout the course.
The points at which the second derivative is 0 are called inflection points. f(x) has two inflection points, marked by vertical dotted lines above, roughly at x=−1.22 and x=0.55. These are the roots of the quadratic equation dx2d2f=3x2+2x−2=0.
We’ve implictly used a second derivative test for determining whether a critical point is a local maximum or minimum:
Again, the second derivative test only tries to tell us whether critical points are local maxima or minima; it does not tell us whether they are global maxima or minima.
Let’s look at another example, particularly one where the second derivative test is inconclusive. Consider f(x)=x2sin(x), shown below.
from utils import plot_functions
import numpy as np
f_list = [
lambda x: (x ** 2) * np.sin(x) ,
# lambda x: x**3 + x**2 - 2*x,
# lambda x: 3*x**2 + 2*x - 2
]
f_titles = [
r'$f(x) = x^2 \sin(x)$',
r'$\text{First Derivative, } \frac{\text{d}f}{\text{d}x}$',
r'$\text{Second Derivative, } \frac{\text{d}^2f}{\text{d}x^2}$'
][:1]
# crit_x = [-2, 0, 1]
x_range = (-2 * np.pi, 2 * np.pi)
y_range = (-25, 25)
fig = plot_functions(
f_list=f_list,
f_titles=f_titles,
x_range=x_range,
y_range=y_range,
show_axis_labels=True,
xaxis_title='',
yaxis_title=''
)
# Mark critical points only on the first two plots (original and first derivative)
# import numpy as np
# for i in range(len(f_list)):
# y_crit = f_list[i](np.array(crit_x))
# fig.add_trace(
# go.Scatter(
# x=crit_x, y=y_crit, mode='markers',
# marker=dict(color=['#3d81f6', 'orange', '#d81b60', '#004d40', '#6f42c1', '#20c997'][i], size=8),
# showlegend=False
# ),
# row=1, col=i+1
# )
# Add extra top margin so titles are not cut off
fig.update_layout(width=500, height=350, showlegend=False, margin=dict(t=50, l=20, r=20, b=0))
fig.show(scale=3, renderer='png')
f(x)=x2sin(x), like sin(x), is oscillatory, and has no global extrema; see here for a larger graph of it. Above, we’ve plotted f(x) within the domain [−2π,2π], and we see several local maxima and minima.
The first and second derivatives of f(x)=x2sin(x) are given by:
Solving for the critical points of f(x) by setting dxdf=x2cos(x)+2xsin(x)=0 is no easy task, as there are infinitely many solutions, most of which cannot be solved for by hand. We’ll learn how to write code to approximate solutions to dxdf=0 in Chapter 4 of the course, when we study gradient descent. There are also infinitely many inflection points, since dx2d2f=0 has infinitely many solutions, meaning that there are many regions where f(x) is concave up and many others where it is concave down.
However, one critical point is easy to spot: x=0. At x=0, the derivative is 0:
dxdf(0)=02cos(0)+2⋅0⋅sin(0)=0
x=0 is also an inflection point, since the second derivative is also 0:
dx2d2f(0)=2⋅0⋅cos(0)−(02−2)sin(0)=0
To be clear, not every inflection point is a critical point, and not every critical point is an inflection point; x=0 just happens to be both.
If we look at the graph of f(x) near x=0, we’ll see that x=0 corresponds to neither a local maximum nor a local minimum, but rather, a region where f(x) is very flat. If we weren’t able to graph f(x), we could try and determine its behavior around (0,f(0)) by looking at points immediately to the left and right of x=0 – say, (0.001,f(0.001)) and (−0.001,f(−0.001)). If f(0.001)>f(0) and f(−0.001)>f(0), then x=0 would be a local minimum (but that’s not the case here).
Activity 5
Activity 5.1
Let f(x)=xlog(x2), where log(⋅) is the natural logarithm.
Find the critical points of f(x), and determine whether they are local maxima, minima, or neither.
Find the inflection points of f(x), and use them to sketch a possible graph of f(x).
Activity 5.2
Let g(3)=10, dxdg(3)=−2, and dx2d2g(3)=1.
Describe the behavior of g(x) near x=3.
The Taylor series of a function allows us to approximate the value of a function near a point x=a, given the value of the function and its derivatives at x=a. The Taylor series of an arbitrary function f(x) around x=a is given by:
Note that this is an infinite series; the more terms we use, the more accurate the approximation.
Use the Taylor series to approximate the value of g(3.1), using only the information provided. You’ll only be able to use the first 3 terms of the Taylor series.
Activity 5.3
Given that dx2d2h=2x(x−3)(x+1), sketch a possible graph of h(x).
Finally, I’ll remark that we’ve presented derivatives, extrema, and optimization all in the most ideal setting: where the functions we’re working with are continuous and differentiable. A function is continuous if its graph can be drawn without lifting a pen; any point where the graph has a “jump” or “break” is a discontinuity. (Of course, there is a more formal definition of continuity, but this is a good enough illustration for now.) A function is differentiable if its derivative exists everywhere; otherwise, there exist some points at which the derivative does not exist.
Most relevant functions in machine learning are continuous, but non-differentiable functions do appear, so it’s worth understanding what they are and how to deal with them. Let’s look at a few examples.
f(x)=∣x∣ is continuous everywhere, as intuitively, we can draw its graph without lifting our pen. It is differentiable everywhere except at x=0; the reason it is not differentiable at x=0 is that the slopes approaching it from the left (-1) and right (1) are different, and in order for a derivative at x=a to exist, the limit of the slopes approaching a from the left and right must be the same.
h(x)={x2+21x+1x+1x<0x≥0 is continuous and differentiable everywhere, despite being a piecewise function. Its individual pieces are continuous, and the entire function is continuous because the “left” and “right” functions at the connection point of x=0 have the same value.
dxdh={2x+212x+11x<0x≥0
Since the two piecewise derivatives agree at x=0, h(x) is differentiable at x=0 (and across its entire domain).
Example 4: k(x)=⎩⎨⎧−(x+2)2+521x+6undefinedx<−2x≥−2,x=4x=4
k(x)=⎩⎨⎧−(x+2)2+521x+6undefinedx<−2x≥−2,x=4x=4 is continuous everywhere, except at x=4, where it has a “jump” and is neither continuous nor differentiable. But in addition, k(x) is not differentiable at x=−2 because the slopes approaching it from the left and right are different.
An important point is that any function that is differentiable everywhere is also continuous everywhere; differentiability is a stronger condition than continuity. Plenty of functions are continuous but not differentiable, like f(x)=∣x∣ in Example 1.