Just some ML algorithms to refresh my memory, without all the fluff
Objective: $\boldsymbol{\min J(\theta)}$,
\[J(\boldsymbol{\theta}) = \frac{1}{2n} \left\| \mathbf{y} - \mathbf{X} \boldsymbol{\theta} \right\|^2\]Expand, \(J(\boldsymbol{\theta}) = \frac{1}{2n} \left[ \mathbf{y}^\top \mathbf{y} - 2 \mathbf{y}^\top \mathbf{X} \boldsymbol{\theta} + \boldsymbol{\theta}^\top \mathbf{X}^\top \mathbf{X} \boldsymbol{\theta} \right]\)
Gradient, \(\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) = \frac{1}{n} \left( \mathbf{X}^\top \mathbf{X} \boldsymbol{\theta} - \mathbf{X}^\top \mathbf{y} \right)\)
Setting gradient to zero, \(\mathbf{X}^\top \mathbf{X} \boldsymbol{\theta} = \mathbf{X}^\top \mathbf{y}\)
Solution: \(\boxed{\boldsymbol{\theta} = \left( \mathbf{X}^\top \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{y}}\)
theta = np.linalg.inv(X.T @ X) @ X.T @ y
# theta = np.linalg.pinv(X) @ y # for the pseudo inverse
X
isn’t always invertible, hence use pseudo inverse instead, likely more stable.Cov(X,Y) / Var(X)
, where X.T @ Y
is the covariance and X.T @ X
is varianceObjective: $\boldsymbol{\min J(\theta)}$, \(J(\boldsymbol{\theta}) = \frac{1}{2n} \left\| \mathbf{y} - \mathbf{X} \boldsymbol{\theta} \right\|^2\)
Gradient: \(\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) = \frac{1}{n} \mathbf{X}^\top \left( \mathbf{X} \boldsymbol{\theta} - \mathbf{y} \right)\)
Gradient Descent Update: \(\boldsymbol{\theta}^{(i+1)} = \boldsymbol{\theta}^{(i)} - \alpha \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}^{(i)})\)
Update Rule: \(\boxed{ \boldsymbol{\theta}^{(i+1)} = \boldsymbol{\theta}^{(i)} - \frac{\alpha}{n} \mathbf{X}^\top \left( \mathbf{X} \boldsymbol{\theta}^{(i)} - \mathbf{y} \right) }\)
where $\alpha$ is the learning rate and $t$ is the iteration number.
num_rows, num_cols = X.shape
## if no bias
## X = np.hstack(np.ones(num_rows, 1), X) and num_cols += 1 because of addn
theta = np.zeros((num_cols, 1))
for i in range(iterations):
theta -= (alpha / num_rows) * X.T @ (X @ theta - y)
Transform X.T @ theta
by applying sigmoid function to transform it into probabilities or a classification problem. The loss changes from mean squared error to cross-entropy loss.
def sigmoid(x):
f = lambda t: 1/(1+np.exp(-t))
return f(x)
def cross_entropy_loss(y, y_pred):
avg_loss = (-y * np.log(y_pred) - (1-y) * np.log(1-y_pred)).mean()
return avg_loss
z = X @ theta
y_pred = sigmoid(z)
loss = cross_entropy_loss(y, y_pred)
tp = np.sum((y_true == 1) & (y_pred == 1))
fp = np.sum((y_true == 0) & (y_pred == 1))
tn = np.sum((y_true == 0) & (y_pred == 0))
fn = np.sum((y_true == 1) & (y_pred == 0))
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = (2 * precision * recall) / (precision + recall)
for threshold in np.arange(0, 1.1, 0.1):
y_pred = (predicted_probs >= threshold).astype(int)
tp = np.sum((y_true == 1) & (y_pred == 1))
fp = np.sum((y_true == 0) & (y_pred == 1))
tn = np.sum((y_true == 0) & (y_pred == 0))
fn = np.sum((y_true == 1) & (y_pred == 0))
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)
tprs.append(tpr)
fprs.append(fpr)
def compute_auc_roc(tprs, fprs):
aucroc = 0
for i in range(1, len(tprs)):
aucroc += 0.5 * abs(fprs[i]-fprs[i-1]) * (trps[i]+tprs[i-1])
return acuroc