Notes for ICML Physics of LLM Talk

 Source: https://youtu.be/yBL7J0kgldU?si=koiBhKpq3Cp1M8G7

  • research methodology
    • deconstruct into building blocks, structure, knowledge, reasoning etc..
    • study in controlled way, idealized environment, control the data, tweak the params
    • highly repeatable experiments
      • 100m size model, universal laws
      • 1xH100 within a day
    • probe inner working
  • knowledge extraction
    • 2 types of data
      • biography of N individuals
      • QA data to extract the fact of the N individuals based on Biography
    • Training data: N biographies, + N/2 QA data
    • Test data: the other N/2 QA data
    • If the model can perform well on the other N/2 individuals’ biography questions, then it has knowledge extraction capability
    • Option 1: Pre train with both N biographies and N/2 QA
      • result: good knowledge extraction
    • Option 2: Pre train with biography data only, fine tune with QA
      • result: bad knowledge extraction
    • Option 3: augment the biography data for each person, pretrain with biography and fine tune with QA
      • result: good knowledge extraction again
    • Analysis with probing
      • With only one biography, a person’s data is only correlated with that person’s name in last layer and no knowledge is in previous layer
      • With augmented data, because of different combination of the biography data, the person’s biograhy and name relationship imbue into all layers of transformer
    • Option 4: pretrain with N celebrities augmented biography and M minority individual who only has single biography, fine tune with QA on N celebrities
      • result: good knowledge extraction even on minority individuals
      • analysis: because celebrities QA teaches model to extract knowledge from minority biography
    • the knowledge extraction capability from mixed training (BIO + QA) only exist on single direction model like GPT but not on bidirection model like BERT
  • knowledge manipulation
    • LLM must use CoT even for simple inference like Joe Biden born in even month, because it needs CoT to write things down
    • LLM cannot do inverse knowledge search like who born in 1995
  • Knowledge capacity scaling law
    • If a knowledge is exposed 1000 times during training , e.g. same knowledge mentioned in different places 1000 times, then model can achieve 2bit/param capacity
      • bit is information entropy bit here. measured by inputting N combination of info and 1 info is log2(N)
    • If not exposed 1000 times, then GatedMLP (some special feedforward layer) in LLaMa/Mistral hurt knowledge capacity
    • 2 bit/param hold for int8 param quantization
    • Junk knowledge hurt capacity
      • If we prepend a domain to each knowledge data, e.g. prepend wikipedia to data during training, then the capacity improves. Good data helps model improve capacity
    • so to store all knowledge in the world, we only need 7B params
  • Grad school Math and the hidden reasoning process
    • A new dataset IGSM (grad school math) which contains math inference questions like a supermarket has 5 bags, each bag holds 5 apples, then supermarket sells 5 x 5 apples in total
    • we can construct a param dependency map. e.g. supermarket → bag → apple
    • One op is one reasoning step
    • LLM perform well on tasks with similar number ops to training data’s
    • LLM also perform well on tasks with more ops than training data’s
    • So LLM is learning reasoning skill rather than reason template
    • LLM can do level 1 reasoning. which means not calculating all intermediate params for a question, find the optimal path - like human
    • LLM can also do level 2 reasoning which means calculating all params before it starts to answer question - human don’t
    • LLM does make reasoning mistake, we can capture before it starts to speak
    • More layers improve reasoning skill especially high depth reasoning with many steps
      • refute GPT’s claim that only model size matters
      • GPT 4o can’t do op ≥ 11
  • Learn from mistake on grad school math
    • LLM makes mistake when a param is not available. e.g. it may try to calculate a param when its dependency is not available
    • LLM knows it makes mistake and wants to regret based on internal state
    • We can insert regret data into training data to help it regret
      • e.g. a wrong inference step + [BACK] keyword at the end
      • The inference accuracy improves a lot, and it doesn’t make mistake during inference even there are mistakes in training data
      • mistake data must be in pretrain data . mistake data in fine tuning doesn’t improve accuracy
    • Mistake data generation
      • dumb: move future inference step up and mark as mistake → large accuracy improvement
      • smart: extract random data from original text, → not much accuracy improvement
  • learning hierarchical language structure
    • Hallucination is actually LLM learns language format much faster than knowledge
    • Create a new grammar called Context Free Grammar CFG which is much more complicated than English like some combination of random number
    • GPT with relative position and rotary position’s accuracy is around 90% similar to uniform position where a static config tells one attention to always look back n token before
    • GPT with absolute position performs much worse 50% accuracy
    • relative position performs little better than rotary position but has higher latency, so rotary position is more commonly used
    • GPT learns the CFG grammar tree, for each token, it encodes its parent, grand parent in the embedding
    • Encoder only model like BERT doesn’t have this capability because its learning objective is predict masked text token. So it is better at learning local relationship. But decoder model like GPT’s learning objective is text generation which needs to learns all info from previous text
    • Attention mechanism is doing dynamic programming to learn CFG.
      • It encodes state of dp in embedding
      • attention mechanism do dp transition step
      • during text generation, it does some more complicated db transition
    • Injecting grammar mistakes in training data also helps
  • The above study is based on synthetic data. OpenAI claim we are out of real data. So we do need synthetic data

Popular posts from this blog

Does Free Consciousness exist ?

Software Architecture Books Summary And Highlights -- Part 1 Goal, Introduction And Index

拉美500年,荆棘丛生的自由繁荣之路