Saturday, January 24, 2026

Self Attention

 

x → Embedding → MultiHeadAttention → Concat → Project to lower dim

→ Add(x) → LayerNorm → FFN → Add → LayerNorm



Vocab to embedding




























torch.nn.embedding(Vocab, embed_dim) 


Batch X Seq Len X Vocab  → Batch X Seq Len X embed_dim


PE = Batch X Seq Len X embed_dim



Self Attention- K, Q matrix, & Attention weight Score


Self Attention- Attention weight softmax example



Self Attention- Attention weighted features


Linear Attention


Tuesday, January 18, 2022

Yet another probability Puzzle - HHT vs HTT

Flip a coin until either HHT or HTT appears. Is one more likely to appear first? If so, which one and with what probability ?

Both the sequence start once first Heads come.
Clearly from the diagram we can see that there are more possibility of reaching HHT than HTT. It can be calculated from diagram as

p(HHT | H)  = 0.25 + 0.25[0.5 + 0.52 + 0.53 + 0.54 ..... ] + 0.25*p(HHT | H)

p(HHT | H) = 0.5/0.75

p(HTT | H) = 0.25 +  0.25*p(HTT | H)

p(HTT | H)  = 0.25/0.75 

Clearly 2 times more likely to reach HHT.



Simulating the above transition matrix/ Markov Chain:

H    --> 0, HH --> 1, HT --> 2, TT --> 3

A = np.array([[0.25, 0.0, 0.0, 0.0],
              [0.25, 0.5, 0.0, 0.0],
              [0.25, 0.5, 1.0, 0.0],
              [0.25, 0.0, 0.0, 1.0]])

x = np.array([1, 0, 0, 0])
for i in range(10):
    print(x)
    x = np.round(np.matmul(A, x),2)
    
Output>> [1 0 0 0]
[0.25 (0.25) 0.25 (0.25)]
[0.06 (0.19) 0.44 (0.31)]
[0.02 (0.11) 0.55 (0.32)]
[0.   (0.06) 0.61 (0.32)]
[0.   (0.03) 0.64 (0.32)]
[0.   (0.02) 0.66 (0.32)]
[0.   (0.01) 0.67 (0.32)]
[0.   (0.  ) 0.68 (0.32)]
[0.   (0.  ) 0.68 (0.32)]

Sunday, January 16, 2022

Basics of Hypothesis testing

 

Parametric

= Significance level = p(H0 is rejected | H0 is true) = Probability of making type-1 error if reject using this alpha. Before doing the statistical test, one writes down this number as: I am ok with   percentage chance of rejecting the null hypothesis i.e. H0 even if H0 should not have been rejected 

p-valuep(observing the sampled parameter estimates or more extreme than it | H0 is true)

βp(failing to reject H0 | Ha is True) = Probability of making type-2 error

1-β = Power  =  p(rejecting H0 | Ha is True)

    •   Power can be influenced by increasing sample size, difference to be observed, and 
    •   ∝%ile value as per null hypothesis, based on the effect size to be observed decide on n(shape of distribution)


Confidence Interval (say 95%)  = 95% probability, that the confidence interval contains true parameter


Bootstrap confidence Interval

  1. x1, x2, x3, .......xn is a data sample drawn from distribution F
  2. For each bootstrap sample δ = x_i - avg(F) 
  3. For each group calculate avg(δ)
  4. Find required quantile avg(δ) and make range around avg(F)
  5. [avg(F) - avg(δ)q1 ,  avg(F) - avg(δ)q2]

Bayesian Hypothesis Testing

P(H0 | Y=y)  > P(H1 | Y=y)





Tuesday, January 11, 2022

Zero Sum Game

Red Bus – Blue Bus problem 





The red bus - blue bus problem, states that the demand transfer happens proportional to the attributes of the product, as different attributes serve different utility to the customers.

This demand transfer can also be seen as Zero sum game, e.g. customer share shifting from 1 product to other product.


Thus a choice model can be created by modelling probability - using probit model or logit model based on attributes of products to understand which attribute influences the customer and by how much.


Utility Theory:





Thursday, January 6, 2022

Coxian Distribution

Coxian Distribution can be used to model:
Service time at a service center that provides bunch of service in a service with option of saying yes or no to continue at each stage.

The basic distribution of a Coxian distribution is
Exponential Distribution

and its pdf is given as
f(x) = μe-μx ; x >= 0

and is generally used to model time span between 2 events that come poison distributed e.g. time between 2 calls in a call center, or time it takes to cut hair in barber shop.

Now, if 2 tasks where service-time/dwell-time are exponentially distributed are placed in
1. Series, they are called Hypo-exponential
e.g.

μ1(hair-cut)   ---->    μ2(shampoo) 
X1    ------>   X2

Now if we need to find sum of  2 random variable then we do convolution

pdf(XX2 = x) = ∫  μ1e-μ_1t * μ2e-μ_2(x-t) dt  
Hint:1.  t+(x-t) =  x; and for all possible value of "t" from -infinity to infinity
        2.  (x-t) in second exponent ranges for values greater than 0, and thus "t" is limited to values between 0 and x 

Hypo-exponential p(x) = ......................

2. Parallel, they are called Hyper-exponential

{
Imagine there is a bag and there are 2 coins {A,B} inside it. An experimenter randomly picks a coin and tosses it to observe {Heads, Tails}.
So the pdf of observing Heads can be written as
P(outcome = Head) = (probability of selecting coin A)(probability of getting head in coin A) + (probability of selecting coin B)(probability of getting head in coin B)
}

to be continued

Wednesday, December 29, 2021

Estimate parameters of Negative Binomial Distribution - NB?

 NB = Number of trials(e.g. say coin toss) "n" required for k success(heads of coin)

p/probability = fairness of coin is to be estimated


{

pmf Derivation

The trial ends on observing  kth success. So nth trial has the kth success.

So n-1 trial have k-1 success and those success can be distributed in all possible ways among n-1 location i.e.

n-1Ck-1 ways, and each way/pattern having probability of p(k-1)(1-p)(n-k) 

And we know that probability of success at nth position is "p".

Probability of n-1 trials and nth trial are independent so we can multiply them to get join distribution

so

PMF = p* n-1Ck-1 p(k-1)(1-p)(n-k) 

n-1Ck-1 p(k)(1-p)(n-k)

----------------------------------------------------------------------

Compounding probability distribution

Lets say p = Beta(α, β) i.e. parameter

P(no of trials=n,  prob of success = p)

 = NBpmf(n, k, p) * Betapmf(p|α, β)

For each value of "p" between 0 and 1, we calculate above probability and sum it up and thus get rid of variable p. [Posterior predictive distribution]

Beta Compounded Negative Binomial PMF = ∫ NBpmf(n, k, p) * Betapmf(p|α, β)

BNBpmf = ∫ n-1Ck-1 p(k)(1-p)(n-k) * p(α-1)(1-p)(β-1)/B(α, β) dp

n-1Ck-1/B(α, β)  ∫ p(k+α-1)(1-p)(n-k+β-1) dp

n-1Ck-1 B(k+α, β+n-k)/B(α, β)


}


Now assume following sample is observed for "n" i.e number of trials for achieving k success and we want to estimate p - probability of success

[n1 , n2 , n3 , n4...….nm]


 pmf1 = n1-1Ck-1 p(k)(1-p)(n1-k)  


Joint distribution of all the observed sample that are independent will product of each pmf


L = i=1m pmfi

Take log to make taking derivative simple, 

LL = i=1Σm log(pmfi

 i=1Σm log(ni-1Ck-1 p(k)(1-p)(ni-k)) 

For maximizing Log likelihood take derivative and equate to zero

LL 


p = mk/Σni




Thursday, December 9, 2021

Estimate parameters of Binomial Distribution ?

Lets assume we observe data as following sample: 

[X1, X2, X3, X4 ......Xm] 

Where vector size[0011111000...m] i.e. number of trials is fixed as "n", and number of success is observed [ k1, k2, k3 .....]

Now we know that pmf(n,p,k) = nCk(p)k (1-p)^(n-k)

{

pmf derivation

k success out of n trials can occur in nCk ways/patterns.

And each of those patterns can happen with probability of (p)k(1-p)n-k

nCk (p)k(1-p)n-k

}


So, likelihood function of observing the data i.e. joint probability can be written as

L = (pmf-1)(pmf-2)(pmf-3).......(pmf-m)

Log L = log(pmf-1) + log(pmf-2) + log(pmf-3).......log(pmf-m)

= K + [k1 + k2 + k3 + ... km]log(p) + [mn -(k1 + k2 + k3 + ... km)] log(1-p)

Gradient for maximizing likelihood 

LL = [k1 + k2 + k3 + ... km]/p - [mn -(k1 + k2 + k3 + ... km)]/(1-p) = 0

p^ = [k1 + k2 + k3 + ... km]/mn

Tuesday, December 7, 2021

Is the coin fair ? given that sample oberved X = [..............]

1st Method CLT:

We calculate E[X] = Proportion of success

H0 = proportion of fair coin = 0.5

Calculate p value proportion test.


2nd Method Binomial pmf:

Pmf = Bin(n_trials, k_success, p=0.5(fair coin null hypothesis))
CDF 
CDF(less than equal to k) < Alpha(significance level) reject null hypothesis


Unboxing blackbox logistic regression(MLE)

Imagine we have a blackbox executable of logistic regression and the 2 hyperparameters tuned are regularisation and probability threshold. 
How can we extract the Beta coefficients of the model ?

Self Attention

  x → Embedding → MultiHeadAttention → Concat → Project to lower dim → → Add(x) → LayerNorm → FFN → Add → LayerNorm Vocab to embedding t...