Current trend: Scale everything—parameters, data, compute.
Where do you want to dig deeper? Neural net architectures? Symbolic logic? The war between statistical and symbolic camps?
用户问题
rules based modelling and how it led to nlp and language corpuses like standfords
ChatGPT
Rules-based modeling in AI was the dominant approach before statistical methods matured. Here’s how it directly led to modern NLP and the creation of large corpora like Stanford’s.
1. Rules-Based NLP: Origins
Early NLP (1960s–1980s) relied on handwritten grammar rules, dictionaries, and logic trees.
Examples:
ELIZA (1966): Used pattern-matching templates to simulate a psychotherapist.
SHRDLU (1970): Parsed natural language into symbolic representations to manipulate blocks in a virtual world.
Core techniques:
Context-Free Grammars (CFGs)
Phrase structure rules
Finite State Automata for morphology
Lexicon-based parsing
Limitation: Language is messy. Rule-based systems were brittle, hard to scale, and failed on ambiguity or novel inputs.
2. Push Toward Corpora and Data
Researchers realized you can’t model all of language from rules. Enter empirical linguistics.
Statistical NLP (1990s) emerged from this frustration.
Big shift: Instead of rules, use probability + data:
Train a model on real examples.
Estimate likelihoods: e.g., “the dog” is more probable than “dog the”.
3. Corpus Creation: Fuel for Data-Driven NLP
To train models, you need lots of labeled data. Hence:
Stanford and Others Step In:
Penn Treebank (1993): Parsed sentences with syntactic trees.
WordNet (Princeton): Lexical relationships.
Stanford NLP Group:
Created corpora like Stanford Sentiment Treebank.
Built open-source tools (POS taggers, NER, parsers) based on probabilistic models.
Later adopted neural models (e.g., GloVe, tree-structured RNNs).
These corpora:
Anchored the shift to machine learning-based NLP.
Set benchmarks.
Were used to evaluate generalization, especially before massive pretraining became dominant.
Stanford and peers standardized the transition by providing clean, labeled, shareable data that let statistical methods flourish.
Want to explore early statistical models like HMMs and CRFs? Or go straight into how GloVe and transformers superseded them?
用户问题
part of speech tagging and ner are relevant to transformers? how?
ChatGPT
Yes—Part-of-Speech (POS) tagging and Named Entity Recognition (NER) are both core sequence labeling tasks, and transformers excel at these. Here’s how they connect:
1. POS Tagging & NER: Classic Tasks
POS tagging: Label each word with its grammatical role (e.g., noun, verb).
NER: Label each word/token as part of an entity (e.g., PERSON, ORG, LOC).
They require contextual understanding—a word’s meaning depends on neighbors. That’s where transformers shine.
2. Transformers: Perfect Fit for Sequence Labeling
Self-attention: Every token sees every other token. Context modeling is deep and bidirectional.
Ex: In “He banked the plane,” vs. “He went to the bank,” “bank” gets a different representation.
Output structure: Transformers output a vector per token—ideal for tagging each token individually.
3. Pretraining + Fine-tuning
Models like BERT are pretrained on huge corpora with language modeling objectives (e.g., masked word prediction).
For tagging tasks:
Add a classification head on top of each token’s embedding.
Fine-tune on a labeled dataset (e.g., CoNLL-2003 for NER).
4. Real Examples
BERT for NER: Fine-tuned on token classification with BIO tags (e.g., B-PER, I-PER, O).
POS tagging: BERT achieves near-human accuracy when fine-tuned on corpora like Penn Treebank.
5. Why Transformers Replaced Older Methods
Before transformers: Used HMMs, CRFs, RNNs, or tree-based models.
These struggled with:
Long-range dependencies
Polysemy
Scalability
Transformers solved these cleanly with parallelism + attention + pretrained knowledge.
So yes—POS tagging and NER didn’t just survive the transformer era—they were early proofs that transformers could handle real NLP tasks better than previous approaches.
Want to see a minimal BERT-based NER pipeline or compare it with CRF-based tagging?
用户问题
polysemy?
ChatGPT
Polysemy is when a single word has multiple related meanings.
Example:
“He sat by the bank of the river.”
“She deposited money at the bank.”
Same word, different meanings—but not totally unrelated (they share historical roots). That’s polysemy.
Why it matters in NLP:
Tagging tasks (like NER, POS) need context to disambiguate.
Is “bank” a noun (place) or a verb (action)?
Is it part of an ORG (in NER) or a location?
How Transformers Help:
Older models (e.g., word2vec) gave a single vector per word, so “bank” always looked the same.
Transformers (e.g., BERT):
Generate contextual embeddings.
“Bank” in a financial sentence looks different from “bank” in a nature sentence—because the model sees surrounding words.
Bottom line:
Polysemy is why static embeddings failed, and why transformers + self-attention + context-aware vectors dominate now.
Want examples of how BERT handles polysemy in practice?
用户问题
explain the self-attention paper
ChatGPT
You’re asking for the “Attention Is All You Need” paper (Vaswani et al., 2017)—the transformer blueprint. Here’s a sharp breakdown:
1. Why This Paper?
Prior models (RNNs, LSTMs) processed sequences sequentially. That limits parallelism and struggles with long-range dependencies.
This paper nukes recurrence. It introduces transformers—based entirely on self-attention.
2. Key Idea: Self-Attention
Each token computes weighted averages over all other tokens in the sequence.
Mechanics:
Each token is turned into 3 vectors:
Query (Q)
Key (K)
Value (V)
The attention score between token i and token j is:
score(i, j) = dot(Qᵢ, Kⱼ)
Then softmax + use it to weight Vs:
Attention(Q,K,V)=softmax(QKᵀ/√dₖ)V
So every token aggregates information from the full sequence—context matters.
3. Multi-Head Attention
Instead of one Q/K/V set, use multiple parallel ones (“heads”) to learn different types of relationships.
Output is concatenated + projected back.
4. Positional Encoding
No recurrence = no order info.
Add sinusoidal vectors to input embeddings to encode position (fixed function; no learning).
5. Encoder-Decoder Architecture
Used for sequence-to-sequence tasks (e.g. translation).
Encoder: Stack of identical layers:
Multi-head self-attention
Feed-forward
Layer norm + residuals
Decoder: Similar, but:
Includes masked self-attention (so tokens don’t see future)
Includes cross-attention to encoder outputs
6. Training Details
Used on machine translation (English-German, English-French).
Beats state-of-the-art (at the time) while being faster to train (due to parallelism).
7. Impact
Kickstarted modern NLP.
Basis for BERT, GPT, T5, etc.
Self-attention generalized across domains: vision, audio, protein folding, etc.
Want me to walk through a specific part—like masked attention, residuals, or positional encodings—in code or diagrams?
用户问题
softmax?
ChatGPT
Softmax is a mathematical function that turns a list of numbers into probabilities—values between 0 and 1 that sum to 1.