ML sketches

Saturday, January 24, 2026

Self Attention

x → Embedding → MultiHeadAttention → Concat → Project to lower dim →

→ Add(x) → LayerNorm → FFN → Add → LayerNorm

Vocab to embedding

torch.nn.embedding(Vocab, embed_dim)

Batch X Seq Len X Vocab → Batch X Seq Len X embed_dim

PE = Batch X Seq Len X embed_dim

Self Attention- K, Q matrix, & Attention weight Score

Self Attention- Attention weight softmax example

Self Attention- Attention weighted features

Linear Attention

Monday, January 19, 2026

Matrix Multiplication Time complexity

Monday, November 27, 2023

Sampling via Inverse Transformation - an example

Tuesday, January 18, 2022

Yet another probability Puzzle - HHT vs HTT

Flip a coin until either HHT or HTT appears. Is one more likely to appear first? If so, which one and with what probability ?

Both the sequence start once first Heads come.

Clearly from the diagram we can see that there are more possibility of reaching HHT than HTT. It can be calculated from diagram as

p(HHT | H) = 0.25 + 0.25[0.5 + 0.5² + 0.5³ + 0.5⁴ ..... ] + 0.25*p(HHT | H)

p(HHT | H) = 0.5/0.75

p(HTT | H) = 0.25 + 0.25*p(HTT | H)

p(HTT | H) = 0.25/0.75

Clearly 2 times more likely to reach HHT.

Simulating the above transition matrix/ Markov Chain:

H --> 0, HH --> 1, HT --> 2, TT --> 3

A = np.array([[0.25, 0.0, 0.0, 0.0],
              [0.25, 0.5, 0.0, 0.0],
              [0.25, 0.5, 1.0, 0.0],
              [0.25, 0.0, 0.0, 1.0]])

x = np.array([1, 0, 0, 0])
for i in range(10):
    print(x)
    x = np.round(np.matmul(A, x),2)

Output>> [1 0 0 0]

[0.25 (0.25) 0.25 (0.25)]

[0.06 (0.19) 0.44 (0.31)]

[0.02 (0.11) 0.55 (0.32)]

[0.   (0.06) 0.61 (0.32)]

[0.   (0.03) 0.64 (0.32)]

[0.   (0.02) 0.66 (0.32)]

[0.   (0.01) 0.67 (0.32)]

[0.   (0.  ) 0.68 (0.32)]

[0.   (0.  ) 0.68 (0.32)]

Sunday, January 16, 2022

Parametric

∝ = Significance level = p(H0 is rejected | H0 is true) = Probability of making type-1 error if reject using this alpha. Before doing the statistical test, one writes down this number as: I am ok with ∝ percentage chance of rejecting the null hypothesis i.e. H0 even if H0 should not have been rejected

p-value = p(observing the sampled parameter estimates or more extreme than it | H0 is true)

β = p(failing to reject H0 | Ha is True) = Probability of making type-2 error

1-β = Power = p(rejecting H0 | Ha is True)

Power can be influenced by increasing sample size, difference to be observed, and ∝
∝%ile value as per null hypothesis, based on the effect size to be observed decide on n(shape of distribution)

Confidence Interval (say 95%) = 95% probability, that the confidence interval contains true parameter

Bootstrap confidence Interval

x1, x2, x3, .......xn is a data sample drawn from distribution F
For each bootstrap sample δ = x_i - avg(F)
For each group calculate avg(δ)
Find required quantile avg(δ) and make range around avg(F)
[avg(F) - avg(δ)_q1 , avg(F) - avg(δ)_q2]

Bayesian Hypothesis Testing

P(H0 | Y=y) > P(H1 | Y=y)

Tuesday, January 11, 2022

Zero Sum Game

Red Bus – Blue Bus problem

The red bus - blue bus problem, states that the demand transfer happens proportional to the attributes of the product, as different attributes serve different utility to the customers.

This demand transfer can also be seen as Zero sum game, e.g. customer share shifting from 1 product to other product.

Thus a choice model can be created by modelling probability - using probit model or logit model based on attributes of products to understand which attribute influences the customer and by how much.

Utility Theory:

Thursday, January 6, 2022

Coxian Distribution

Coxian Distribution can be used to model:

Service time at a service center that provides bunch of service in a service with option of saying yes or no to continue at each stage.

The basic distribution of a Coxian distribution is

Exponential Distribution

and its pdf is given as

f(x) = μe^-μx ; x >= 0

and is generally used to model time span between 2 events that come poison distributed e.g. time between 2 calls in a call center, or time it takes to cut hair in barber shop.

Now, if 2 tasks where service-time/dwell-time are exponentially distributed are placed in

1. Series, they are called Hypo-exponential

e.g.

μ_1(hair-cut) ----> μ_2(shampoo)

X1 ------> X2

Now if we need to find sum of 2 random variable then we do convolution

pdf(X₁+ X2 = x) = ∫ μ₁e^-μ_1t * μ₂e^-μ_2(x-t) dt

Hint:1. t+(x-t) = x; and for all possible value of "t" from -infinity to infinity

2. (x-t) in second exponent ranges for values greater than 0, and thus "t" is limited to values between 0 and x

Hypo-exponential p(x) = ......................

∞−∞

2. Parallel, they are called Hyper-exponential

{

Imagine there is a bag and there are 2 coins {A,B} inside it. An experimenter randomly picks a coin and tosses it to observe {Heads, Tails}.

So the pdf of observing Heads can be written as

P(outcome = Head) = (probability of selecting coin A)(probability of getting head in coin A) + (probability of selecting coin B)(probability of getting head in coin B)

}

to be continued

Wednesday, December 29, 2021

Estimate parameters of Negative Binomial Distribution - NB?

NB = Number of trials(e.g. say coin toss) "n" required for k success(heads of coin)

p/probability = fairness of coin is to be estimated

{

pmf Derivation

The trial ends on observing k_th success. So n_th trial has the k_th success.

So n-1 trial have k-1 success and those success can be distributed in all possible ways among n-1 location i.e.

^n-1C_k-1 ways, and each way/pattern having probability of p^(k-1)(1-p)^(n-k)

And we know that probability of success at nth position is "p".

Probability of n-1 trials and n_th trial are independent so we can multiply them to get join distribution

PMF = p* ^n-1C_k-1 p^(k-1)(1-p)^(n-k)

⁼^n-1C_k-1 p^(k)(1-p)^(n-k)

----------------------------------------------------------------------

Compounding probability distribution

Lets say p = Beta(α, β) i.e. parameter

P(no of trials=n, prob of success = p)

= NB_pmf(n, k, p) * Beta_pmf(p|α, β)

For each value of "p" between 0 and 1, we calculate above probability and sum it up and thus get rid of variable p. [Posterior predictive distribution]

Beta Compounded Negative Binomial PMF = ∫ NB_pmf(n, k, p) * Beta_pmf(p|α, β)

BNB_pmf = ∫ ^n-1C_k-1 p^(k)(1-p)^(n-k) * p^(α-1)(1-p)^(β-1)/B(α, β) dp

= ^n-1C_k-1/B(α, β) ∫ p^(k+^α-1⁾(1-p)^(n-k+^β-1⁾ dp

= ^n-1C_k-1 B(k+α, β+n-k)/B(α, β)

}

Now assume following sample is observed for "n" i.e number of trials for achieving k success and we want to estimate p - probability of success

[n₁ , n₂ , n₃, n₄...….n_m]

pmf₁ = ^n1-1C_k-1 p^(k)(1-p)^(n1-k)

Joint distribution of all the observed sample that are independent will product of each pmf

L = _i=1∏^m pmf_i

Take log to make taking derivative simple,

LL = _i=1Σ^m log(pmf_i

= _i=1Σ^m log(^ni-1C_k-1 p^(k)(1-p)^(ni-k))

For maximizing Log likelihood take derivative and equate to zero

⛛LL =

p = mk/Σn_i

Thursday, December 9, 2021

Estimate parameters of Binomial Distribution ?

Lets assume we observe data as following sample:

[X1, X2, X3, X4 ......Xm]

Where vector size[0011111000...m] i.e. number of trials is fixed as "n", and number of success is observed [ k1, k2, k3 .....]

Now we know that pmf(n,p,k) = nCk(p)k (1-p)^(n-k)

{

pmf derivation

k success out of n trials can occur in nCk ways/patterns.

And each of those patterns can happen with probability of (p)k(1-p)n-k

nCk (p)k(1-p)n-k

}

So, likelihood function of observing the data i.e. joint probability can be written as

L = (pmf-1)(pmf-2)(pmf-3).......(pmf-m)

Log L = log(pmf-1) + log(pmf-2) + log(pmf-3).......log(pmf-m)

= K + [k1 + k2 + k3 + ... km]log(p) + [mn -(k1 + k2 + k3 + ... km)] log(1-p)

Gradient for maximizing likelihood

∇LL = [k1 + k2 + k3 + ... km]/p - [mn -(k1 + k2 + k3 + ... km)]/(1-p) = 0

p^ = [k1 + k2 + k3 + ... km]/mn

Tuesday, December 7, 2021

Is the coin fair ? given that sample oberved X = [..............]

1st Method CLT:

We calculate E[X] = Proportion of success

H0 = proportion of fair coin = 0.5

Calculate p value proportion test.

2nd Method Binomial pmf:

Pmf = Bin(n_trials, k_success, p=0.5(fair coin null hypothesis))

CDF

CDF(less than equal to k) < Alpha(significance level) reject null hypothesis

Unboxing blackbox logistic regression(MLE)

Imagine we have a blackbox executable of logistic regression and the 2 hyperparameters tuned are regularisation and probability threshold.

How can we extract the Beta coefficients of the model ?

ML sketches

Saturday, January 24, 2026

Self Attention

Monday, January 19, 2026

Matrix Multiplication Time complexity

Monday, November 27, 2023

Sampling via Inverse Transformation - an example

Tuesday, January 18, 2022

Yet another probability Puzzle - HHT vs HTT

Flip a coin until either HHT or HTT appears. Is one more likely to appear first? If so, which one and with what probability ?

Sunday, January 16, 2022

Basics of Hypothesis testing

Parametric

Bootstrap confidence Interval

Bayesian Hypothesis Testing

Tuesday, January 11, 2022

Zero Sum Game

Red Bus – Blue Bus problem

Thursday, January 6, 2022

Coxian Distribution

Wednesday, December 29, 2021

Estimate parameters of Negative Binomial Distribution - NB?

Thursday, December 9, 2021

Estimate parameters of Binomial Distribution ?

Tuesday, December 7, 2021

Is the coin fair ? given that sample oberved X = [..............]

Unboxing blackbox logistic regression(MLE)

Self Attention

Pageviews past week

Report Abuse