BERTVision

Parameter-efficient QA and text classification using BERT hidden state activations

Scroll to explore

Introduction

BERTVision presents a parameter-efficient approach for Question Answering (QA) and text classification that leverages BERT's hidden state activations. By extracting and utilizing the internal transformer layer outputs—typically discarded during inference—our method enables efficient fine-tuning with significantly reduced computational requirements.

The key insight is that BERT's intermediate layer representations contain rich semantic information that can be harnessed for downstream tasks. Rather than fine-tuning the entire 110M+ parameter model, BERTVision trains lightweight adapter networks on pre-extracted embeddings, achieving competitive performance while using only a fraction of the trainable parameters.

Key Contributions

  • Parameter Efficiency: Achieves comparable results to full fine-tuning while training only ~0.5% of the original model parameters
  • Embedding Reuse: Pre-extracted hidden states can be cached and reused across multiple experiments, dramatically reducing training time
  • Accessibility: Enables NLP research on consumer hardware without high-end GPUs
  • Comprehensive Evaluation: Tested across 9 GLUE benchmark tasks and SQuAD 2.0

This approach is particularly valuable for researchers and practitioners working with limited computational resources, or when rapid experimentation across multiple tasks is required.

How It Works

BERTVision operates in two phases: first, we extract hidden state embeddings from all transformer layers during a single forward pass through BERT. Then, we train lightweight adapter networks on these cached embeddings to perform downstream tasks.

Data Pipeline & Span Annotation

For question answering tasks like SQuAD, the data pipeline processes context-question pairs and annotates answer spans. The embeddings capture contextual relationships between the question and passage, enabling the adapter to learn span boundaries efficiently.

Data Pipeline and Span Annotation diagram showing how input text is processed through BERT layers
Figure 1: Data pipeline for SQuAD 2.0, showing tokenization, span annotation, and embedding extraction

Model Architecture

The BERTVision QA model architecture uses an AdapterPooler that processes hidden states from multiple BERT layers. Rather than only using the final layer output, we leverage representations from all 12 (or 24 for BERT-large) transformer layers, capturing both low-level syntactic and high-level semantic features.

BERTVision QA Model architecture diagram
Figure 2: AdapterPooler architecture for question answering, showing multi-layer hidden state processing

Development & Training Pipeline

The development pipeline shows the complete workflow from raw text to trained adapter models. Embeddings are extracted once and stored as HDF5 files, then reused across multiple training experiments with different hyperparameters, dramatically reducing overall compute time.

BERTVision development pipeline diagram
Figure 3: End-to-end development pipeline from data preparation to model evaluation

Technical Details

  • Embedding Dimensions: Each token produces a 768-dim (base) or 1024-dim (large) vector per layer
  • Storage Format: HDF5 files with efficient compression for fast I/O
  • Adapter Architecture: Multi-head attention pooler with learned layer weights
  • Training: AdamW optimizer with linear warmup and cosine decay

NLP Tasks & Data

BERTVision was evaluated on two major NLP benchmarks: SQuAD 2.0 for extractive question answering and the GLUE benchmark for text classification and natural language inference tasks.

SQuAD 2.0

The Stanford Question Answering Dataset 2.0 is a reading comprehension benchmark where models must extract answer spans from passages or determine when questions are unanswerable. SQuAD 2.0 contains over 150,000 questions with 50,000 unanswerable questions that require models to abstain rather than guess.

150k+ Questions
50k Unanswerable
500+ Wikipedia Articles

GLUE Benchmark

The General Language Understanding Evaluation benchmark is a collection of nine diverse natural language understanding tasks, including sentiment analysis, textual similarity, paraphrase detection, and natural language inference.

Task Full Name Samples Type
CoLA Corpus of Linguistic Acceptability 8.5k Acceptability
SST-2 Stanford Sentiment Treebank 67k Sentiment
MRPC Microsoft Research Paraphrase Corpus 3.7k Paraphrase
STS-B Semantic Textual Similarity Benchmark 5.7k Similarity
QQP Quora Question Pairs 364k Paraphrase
MNLI Multi-Genre Natural Language Inference 393k NLI
QNLI Question Natural Language Inference 105k NLI
RTE Recognizing Textual Entailment 2.5k NLI

Results

BERTVision achieves competitive performance across all evaluated tasks while using significantly fewer trainable parameters. In several cases, the parameter-efficient approach matches or exceeds full fine-tuning results, demonstrating that BERT's hidden representations contain sufficient information for effective task adaptation.

Performance comparison between BERT and BERTVision across tasks
Figure 4: Performance comparison showing BERTVision (blue) vs BERT baseline (gray) across benchmark tasks

Detailed Results by Task

Select a benchmark category and task to view detailed metrics. Each result includes the exact commands needed to replicate the experiments.

Corpus of Linguistic Acceptability

View Dataset

CoLA is a linguistic acceptability task where the goal is to predict whether an English sentence is grammatically acceptable.

Matthews 53.37% 20.67% 60.00% 43.20%

Hyperparameter Configuration

Optimal hyperparameters were identified through systematic search using Hyperopt. The table below shows the recommended settings for each task category.

Task Category Learning Rate Batch Size Max Seq Length Num Labels
MNLI / QQP / QNLI 1e-5 32 128 2-3
RTE 2e-5 16 250 2
SST / MRPC / CoLA 2e-5 16-32 128 2
STS-B 2e-5 16 128 1
SQuAD 2.0 2e-5 8-16 384 -

Replication

All experiments can be replicated using the code in our GitHub repository. The following sections provide the exact commands for generating embeddings and training models for each task.

Prerequisites: Clone the repository and set up the environment as described in the README. All commands should be run from the code/torch directory.

Embedding Generation

Before training adapter models, you must first generate and cache the BERT embeddings. Run these commands from code/torch/gen_embeds/.

Model Training

Train BERT baseline and BERTVision adapter models. Run from code/torch/.

Error Analysis

Generate detailed error analysis with predicted vs. actual labels.

Research Paper

For detailed methodology, experimental setup, ablation studies, and comprehensive analysis, please refer to our research paper. The paper includes additional experiments on layer-wise contribution analysis and parameter freezing strategies.

BERTVision: Parameter-Efficient Transfer Learning for NLP

Jiang, S.; Benge, C.; King, W. (2020)

We present BERTVision, a parameter-efficient approach to transfer learning for natural language processing tasks. By leveraging hidden state activations from pre-trained BERT models, we achieve competitive performance on GLUE and SQuAD benchmarks while training only a fraction of the parameters required for full fine-tuning.

Download PDF

Citation

@article{bertvision2020,
  title={BERTVision: Parameter-Efficient Transfer Learning for NLP},
  author={Jiang, Stone and Benge, Cristopher and King, William Casey},
  year={2020},
  url={https://cbenge509.github.io/BERTVision/}
}