Multi-LLM Evaluation Platform
This is a desktop application I built from scratch as the primary engineering artifact behind my first-author ICML 2026 submission with Prof. Mingfeng Lin at Georgia Tech. The research investigates how LLMs revise decisions under conflicting information โ specifically, whether models update beliefs based on evidence (rational revision) or simply defer to authority framing (sycophantic compliance). Running that experiment at scale required a tool that didn't exist: one that can simultaneously query multiple LLMs through their native web interfaces, not just APIs, collect responses, and automatically evaluate whether they agree.
Why Build a Custom Browser, Not Just Use APIs?
Most LLM evaluation frameworks operate through APIs. That works for standardized benchmarks, but my research required testing models through the same interface end users interact with โ the web chat UI. Model behavior can differ between API calls and web interfaces: the web versions have system prompts, guardrails, context-window management, and post-processing layers that raw APIs don't expose. If you're studying how models respond to adversarial framings, you need to test the version users actually encounter.
APIs also don't cover every model. Meta AI has no public API. Duck.ai is browser-only. Amazon Nova through the Bedrock console is a different product than its API variant.
Supported browser services:
The BrowserView Automation Engine
Unlike Playwright, BrowserView provides no built-in selectors, waits, or assertions. I built all of that from scratch.
Message Injection โ Three Phases
- Focus acquisition โ locate the input element using service-specific CSS selectors (with a generic fallback that scans for visible textareas by bounding-box dimensions), then click and focus it.
- Text insertion โ two strategies:
- React-controlled inputs (ChatGPT, Meta AI): Directly mutate DOM value and dispatch synthetic
InputEvent+changeevents to trigger React's reconciliation. - All others: Simulate character-by-character keyboard input via
webContents.sendInputEvent()โ framework-agnostic and triggers all native handlers.
- React-controlled inputs (ChatGPT, Meta AI): Directly mutate DOM value and dispatch synthetic
- Submission โ cascade of 3 strategies:
- Service-specific button click (e.g.,
button[data-testid="send-button"]for ChatGPT,button[aria-label="ๅ้"]for Duck.ai) - Keyword-based generic fallback (textContent / aria-label / data-testid / className matching: send, submit, ask, ๅ้โฆ)
- Icon-button heuristic: buttons containing only SVG/icon elements โ "last icon button" default targets the send button on virtually every chat UI
- Service-specific button click (e.g.,
Response Extraction โ Service-Specific Selectors
| Service | Extraction Strategy |
|---|---|
| ChatGPT | [data-message-author-role="assistant"] โ last element's text |
| Claude | [class*="font-claude-response"] โ select longest text (skips model-name badges like "Sonnet 4.5") |
| Gemini | message-content[class*="model-response-text"] |
| Mistral / DeepSeek | Author-role attributes similar to ChatGPT |
| Meta AI | Traverse .html-div bottom-up, filter navigation text, strip locale prefixes |
| Duck.ai | Parse document.body.textContent โ locate model-name markers ("GPT-4o mini," "Claude"โฆ) and extract text between marker and trailing UI elements |
Every extraction includes a validity filter rejecting JSON payloads, navigation strings, and UI artifacts. A stability check (text unchanged across 2+ consecutive polls) determines response completion โ no minimum length, so short answers like "Yes" or "Vatican City" are captured correctly.
LLM-as-a-Judge Pipeline
Once all bot responses are collected, they go to a coordinator model for consistency evaluation โ either API-based (default: gpt-4o-mini via OpenRouter) or browser-based (a designated non-bot tab that receives the evaluation prompt directly).
consistent (bool) ยท verdict (string) ยท unclear_responses (array) ยท main_differences (array, max 3 bullets).System Architecture
Build & Distribution
Delivers as a single installable binary โ no Python, no Node.js, no browser install required by the end user:
What Made This Difficult
DOM fragility. Every LLM service updates their UI every 2โ4 weeks. CSS classes change, data attributes get renamed, React components re-render differently. The solution: layered selectors (service-specific primary โ keyword-based generic โ icon-button heuristic) that survive most UI refreshes, plus graceful failure (broken selector returns empty text rather than crashing).
React synthetic events. ChatGPT and Meta AI use React-controlled inputs. Setting .value directly doesn't work โ React doesn't see the DOM mutation. You need synthetic InputEvent with bubbles: true + a change event to trigger reconciliation. Only discoverable through trial and error.
Cross-platform session persistence. Each service needs its own Electron session partition (persist:chatgpt, persist:claude, etc.) to maintain authentication state independently. Stale sessions cause authentication loops; some services' CSRF protections conflict with BrowserView's security model. Getting all 9 services to maintain stable login state across restarts took significant iteration.
