Natural Language Processing

JamC-QA

A Japanese language test with 2,341 multiple-choice questions designed to evaluate how well AI models understand Japanese culture, history, geography, and other knowledge areas.

JamC-QA Japanese LLMs benchmark dataset multiple-choice QA LLM evaluation
Created: December 18, 2025

What is JamC-QA?

JamC-QA (Japanese Multiple Choice Question Answering) is a large-scale benchmark dataset specifically designed to evaluate large language models on Japanese-specific knowledge and cultural understanding. The dataset tests models across eight carefully selected domains including Japanese culture, customs, regional identity, geography, history, government, law, and healthcare.

The benchmark fills a critical gap in LLM evaluation by focusing on knowledge areas underrepresented or entirely absent in major international benchmarks like MMLU, HellaSwag, or SQuAD. JamC-QA enables fair, domain-specific comparison of Japanese and multilingual LLMs, supports leaderboard evaluations and ablation studies, and provides essential validation for AI systems serving Japanese-speaking users.

JamC-QA is widely adopted for benchmarking Japanese language models, appears on major leaderboards including the Swallow LLM Leaderboard, and is referenced in academic literature as a gold standard for Japan-centric factual and general knowledge proficiency.

Dataset Composition

Knowledge Categories

JamC-QA comprises 2,341 multiple-choice questions across eight knowledge categories selected for their relevance to Japanese society and absence from other popular QA benchmarks:

CategoryDevTestFocus Area
Culture4640Arts, cinema, literature, music, cultural literacy
Custom4200Social customs, etiquette, festivals, traditions
Regional Identity4397Regional knowledge, dialects, local phenomena
Geography4272Physical geography, place names, natural features
History4343Historical events, figures, periods, cultural shifts
Government4110Political systems, policies, governmental roles
Law4299Legislation, legal systems, rights, regulations
Healthcare448Medical systems, terminology, public health
Total322,309

Category Selection Rationale:

  • Core to Japanese daily life and culture
  • Underrepresented in global benchmarks
  • Requires specific cultural and linguistic knowledge
  • Spans factual recall to contextual understanding

Data Splits

Development Split (32 questions)

  • Four questions per category
  • Used for few-shot evaluation
  • Enables model calibration with minimal exposure
  • Supports prompt engineering and fine-tuning

Test Split (2,309 questions)

  • Main evaluation testbed
  • Statistically robust sample per category
  • Used for leaderboard rankings
  • Enables detailed performance analysis

Dataset Structure

Data Format

Each instance is a single multiple-choice question with four answer options and one correct answer. The dataset is formatted for Hugging Face datasets library integration and supports evaluation frameworks like FlexEval.

Example Instance:

{
  "qid": "jamcqa-test-culture-00001",
  "category": "culture",
  "question": "「狂った世で気が狂うなら気は確かだ」の名言を残した映画はどれ?",
  "choice0": "影武者",
  "choice1": "羅生門",
  "choice2": "隠し砦の三悪人",
  "choice3": "乱",
  "answer_index": 3
}

Field Definitions

FieldTypeDescription
qidstringUnique question identifier
categorystringKnowledge category label
questionstringQuestion text in Japanese (half-width except katakana)
choice0-3stringFour answer options (half-width except katakana)
answer_indexintegerIndex of correct answer (0-3)

Data Constraints:

  • No line breaks in any field
  • Leading and trailing whitespace removed
  • Half-width characters except for katakana
  • Each question curated for cultural accuracy

Evaluation Methodology

Primary Metric

Exact Match Accuracy
Models must output the exact answer string (not just the label or index). This strict criterion ensures true retrieval or generation capability, not approximation.

Calculation:

Accuracy = (Number of exact matches) / (Total number of questions)

Category-Level Analysis
Accuracy reported per category enables fine-grained analysis of model strengths and weaknesses across knowledge domains.

Why Exact Match?

  • Ensures precise answer generation capability
  • Critical for factual and culturally nuanced questions
  • Prevents partial credit for close but incorrect answers
  • Validates true understanding versus pattern matching

Performance Analysis

Category-level results reveal:

  • Domain-specific model strengths
  • Knowledge gaps requiring attention
  • Transfer learning effectiveness
  • Cultural adaptation success

Leaderboard Results

Representative performance from major Japanese LLM leaderboard (accuracy scores):

ModelAllCultureCustomRegionalGeographyHistoryGovernmentLawHealthcare
sarashina2-8x70b0.7250.7140.7750.7610.6540.7840.7360.6320.917
sarashina2-70b0.7250.7190.7450.7360.6730.7640.7640.6660.917
Llama-3.3-Swallow-70B-v0.40.6970.6890.7750.5890.5660.7760.7730.7830.854
RakutenAI-2.0-8x7B0.6330.6220.7250.6170.5110.7140.7090.5750.813
plamo-100b0.6030.6020.6500.6370.5040.6820.6090.5150.688

Key Observations:

  • Best overall performance: sarashina2 models (0.725)
  • Strongest category: Healthcare (up to 0.917)
  • Greatest variation: Regional identity and geography
  • Model diversity: Japanese-specialized and multilingual LLMs

Usage and Implementation

Loading with Hugging Face

import datasets

# Load dataset
jamcqa = datasets.load_dataset('sbintuitions/JamC-QA', 'v1.0')

# Access splits
jamcqa_test = jamcqa['test']
jamcqa_dev = jamcqa['dev']

# Inspect question
print(jamcqa_test[0])

Dataset Viewer:
Browse and filter interactively on Hugging Face Data Studio.

Evaluation with FlexEval

FlexEval (v0.13.3+) provides unified evaluation for diverse tasks and metrics:

flexeval_lm \
  --language_model HuggingFaceLM \
  --language_model.model "sbintuitions/sarashina2.2-0.5b" \
  --language_model.default_gen_kwargs "{ do_sample: false }" \
  --eval_setup "jamcqa" \
  --save_dir "results/jamcqa"

Configuration:

  • do_sample: false ensures deterministic (greedy) decoding
  • Output includes exact match accuracy and generation statistics
  • Supports batch processing and parallel evaluation

Practical Applications

LLM Benchmarking

Standard Comparison:

  • Quantitative evaluation of Japanese LLMs
  • Fair comparison across model architectures
  • Performance tracking across versions
  • Transfer learning assessment

Model Selection:

  • Identify best model for Japanese applications
  • Validate cultural adaptation effectiveness
  • Compare specialized vs multilingual models
  • Guide deployment decisions

Research Applications

Ablation Studies:

  • Identify domain-specific strengths and weaknesses
  • Evaluate training data impact
  • Test architecture variations
  • Analyze fine-tuning effectiveness

Cross-Lingual Transfer:

  • Assess knowledge transfer from multilingual training
  • Evaluate translation-based approaches
  • Test cultural adaptation strategies
  • Compare monolingual vs multilingual performance

Educational Technology

AI Tutor Development:

  • Validate Japanese knowledge accuracy
  • Test cultural understanding
  • Ensure appropriate content delivery
  • Verify regional awareness

Assessment Systems:

  • Benchmark question generation systems
  • Validate answer evaluation accuracy
  • Test adaptive learning algorithms
  • Ensure cultural appropriateness

Cultural Adaptation

Localization Validation:

  • Verify AI meets local knowledge expectations
  • Test cultural sensitivity
  • Validate regional understanding
  • Ensure appropriate content generation

JamC-QA is part of a growing Japanese LLM evaluation ecosystem:

Complementary Benchmarks:

  • MMLU-ProX (Japanese): Multi-discipline college-level reasoning
  • GPQA (Japanese): Graduate-level science QA
  • JHumanEval: Japanese code generation
  • MATH-100 (Japanese): Competition-level mathematics
  • M-IFEval-Ja: Instruction following control

Benchmark Ecosystem Benefits:

  • Cross-benchmark transfer studies
  • Comprehensive model diagnostics
  • Local relevance validation
  • International comparison baseline

Implementation Best Practices

Evaluation Setup:

  • Use deterministic decoding for reproducibility
  • Report category-level results
  • Include confidence intervals
  • Document evaluation parameters

Model Preparation:

  • Validate Japanese text processing
  • Test tokenization appropriately
  • Verify encoding handling
  • Ensure proper formatting

Results Analysis:

  • Compare across categories
  • Identify systematic weaknesses
  • Analyze error patterns
  • Test edge cases

Continuous Improvement:

  • Regular benchmark updates
  • Track performance over time
  • Monitor distribution shifts
  • Validate new model versions

References

Related Terms

×
Contact Us Contact