Corpora and AI / LLMs
Introduction to Mark Davies and His Work
Overview of Expertise
- Mark Davies introduces himself as a professor emeritus of corpus linguistics, highlighting his extensive publication record on creating and utilizing corpora for linguistic research.
- He is the creator of the corpora from englishcorpa.org, which are widely used by researchers, teachers, and students globally.
Comparing LLMs with Corpus Data
Focus of the Video
- The video aims to compare data from two large language models (LLMs): Chat GPT-4 from OpenAI and Gemini 1.5 Pro from Google.
- It mentions that similar results can be observed in other LLMs like those developed by Anthropic, Meta, or Deep Seek.
Purpose and Structure of the Analysis
Key Questions Addressed
- The video seeks to answer how well LLM predictions align with actual corpus data and what factors influence their accuracy or inaccuracy.
- It also explores the role of traditional corpora in an era dominated by LLM technology.
Outline of Content
- The discussion will cover collocates, word frequency predictions from LLMs compared to corpus data, genre variation over time, dialect differences, and concluding remarks on the usefulness of both LLMs and full-featured corpora.
Strength of LLMs: Collocates
Insights into Collocates
- LLMs excel at identifying collocates—words that frequently appear near each other—which provides insights into meanings.
- Examples include analyzing collocates for words like "colander" and "sculpin," showcasing how these models outperform traditional corpora in providing contextual understanding.
Comparison Between Corpora and LLM Outputs
Effectiveness in Meaning Interpretation
- While Sketch Engine offers useful displays for contrasting collocates, users must interpret meanings themselves; whereas LLM outputs provide explanations automatically.
- This automatic interpretation is particularly beneficial for non-native speakers and language learners who may struggle with nuanced meanings.
Word Frequency Predictions
Accuracy Assessment
- Overall predictions regarding word frequency from LLM outputs align well with actual corpus data when given a range of words based on frequency lists.
Analysis of LLMs and Corpus Data
Limitations of LLMs in Word Frequency Generation
- LLMs struggle to generate high-frequency words accurately, as evidenced by the yellow words that appear frequently in both Koka and IW Webb Corpora but are absent from LLM-generated lists.
- The generation of low-frequency words (around 40,000 or 50,000) is particularly poor in LLM outputs; authentic corpus-based data is recommended for reliable frequency information.
Phrase Frequency Challenges
- The phrases generated by LLMs often do not align with high-frequency phrases found in corpora, indicating a gap in their ability to produce relevant content. The orange entries represent what LLMs consider top phrases but lack corpus support.
- Interestingly, when provided with high-frequency phrases from corpora, LLMs can rank them accurately based on frequency, showcasing their recognition skills despite initial generation shortcomings.
Genre Recognition Capabilities
- LLMs demonstrate proficiency in categorizing words and phrases into genres like spoken fiction and academic language but face challenges distinguishing between certain genres such as newspapers versus academic writing.
- They excel at identifying sub-genres within academia (e.g., legal or medical), yet struggle with differentiating among blog types (opinion vs instructional). Their strength lies more in genre distinction than generating genre-specific lists.
Syntactic Construction Analysis
- Predictions made by LLMs regarding syntactic or grammatical constructions' frequency across genres align well with corpus data; they provide explanations for these differences effectively unlike traditional corpora which only present raw data.
- However, hallucinations may occur when comparing genres, indicating potential inaccuracies that need further exploration later in the discussion.
Historical Language Change Predictions
- Predictions about word frequency changes over time are generally accurate; this includes informal language trends observed on TV shows and movies as well as syntactic construction shifts (e.g., "we haven't the time" evolving to "we don't have the time").
- While GPT provides helpful insights into these changes, Gemini appears hesitant to make predictions regarding specific constructions' historical usage patterns. For comprehensive analysis refer to the white paper on historical change at englishcra.org.
Challenges with Less Obvious Words
- The effectiveness of LLM predictions diminishes when dealing with less obvious terms compared to clear-cut examples like "battleship" or "selfie." This suggests reliance on contextual clues from texts rather than an inherent understanding of language evolution over decades.
Analysis of LLM Predictions and Corpus Data
Dialectal Variation Predictions
- The predictions from large language models (LLMs) align well with corpus data regarding informal word usage across dialects, such as US vs. UK English.
- LLMs demonstrate a good ability to predict the frequency of words in different countries, although they do make occasional mistakes.
- GPT outperforms Gemini in accuracy when predicting syntactic constructions' frequency across dialects.
Collocates and Cultural Context
- LLMs excel at identifying collocates, revealing differences in meaning and usage between dialects, exemplified by the word "cupboard" in US and UK contexts.
- The adjectives associated with "wife" differ significantly between Asian/African dialects and inner circle dialects like US or UK, reflecting cultural practices.
- While LLM predictions are accurate for obvious terms (e.g., "subarctic" in Canada), they struggle with less common words specific to certain regions (e.g., "shaggy" in Jamaica).
Limitations of LLM Analysis
- There are instances where LLM outputs appear to be regurgitated information rather than original analysis, particularly noted with the term "cupboard."
- Gemini inaccurately claims that the construction "try and verb" is widely accepted in American English despite corpus evidence suggesting otherwise.
Strengths of Large Language Models
- LLM capabilities include generating collocates effectively while categorizing information about words across genres, time periods, and dialect variations.
- They possess a solid understanding of syntactic differences but often replicate information from other sources instead of providing novel insights.
Challenges with Word Frequency Data
- LLM predictions fall short when it comes to generating accurate word frequency data; using a dedicated corpus is recommended for precise statistics.
- They also struggle with nuanced variations between genres or time periods compared to broader syntactic variations observed across different contexts.
Advantages of Using Corpora
- Corpora provide verifiable data that can be checked against original sources, unlike LLM-generated content which lacks transparency on data origins.
Advantages of Corpora Over LLMs
Variability in Data Reliability
- The data collected over two different days showed significant variability, with only 12 out of 20 words starting with "SPRI" being consistent. This raises questions about the reliability of such data.
Hallucination and Accuracy Issues
- Large Language Models (LLMs), like Gemini, often produce inaccurate syntactic information, mistakenly believing certain constructions are more common in one region than another (e.g., "stopped him doing that" vs. "stopped him from doing that").
- LLMs struggle with narrow lexically driven constructions, frequently hallucinating or providing incorrect information for specific phrases compared to broader linguistic phenomena.
Keyword Contextualization
- Corpora allow users to see keywords in context, revealing patterns and associations that are difficult to discern using LLMs. For instance, negative words often precede the word "fathom."
- Unlike LLM-generated phrases which may be fabricated, corpora provide authentic examples from real texts.
Immersive Learning Experience
- Corpora offer rich interfaces with interconnected links between words and phrases, enhancing the learning experience for language learners compared to simple text responses from LLMs.
- Users can explore related terms extensively; for example, searching for "path" leads to a wealth of information on associated words like "trail," including definitions and collocates.
User Applications and Benefits
- Researchers seeking reliable data should prefer corpus data over LLM outputs due to accuracy concerns regarding frequency generation.