<aside>
📋 Research Disclaimer
This project is an independent academic and portfolio exercise conducted for educational purposes. All AI outputs documented here were collected through standard public interfaces available to any user. This analysis does not endorse or disparage any AI product or company. Observations are based on specific task outputs collected during a defined testing period and reflect the behavior of these tools at that point in time, not their overall capability or quality.
All screenshots are used for educational commentary and analysis only. No proprietary content has been reproduced beyond what is necessary to illustrate specific product behaviors. The findings represent one researcher's structured observations, not a definitive benchmark or industry report.
</aside>
This project tests three leading AI products — ChatGPT (OpenAI), Google AI Mode, Claude (Anthropic), and Gemini (Google), across five real task types to document where and how each fails its users. The findings are organized into a failure taxonomy with real observed examples, five product solutions, a trust recovery framework, and an executive memo written for a VP of Product audience.
AI Products Tested: ChatGPT (OpenAI), Google AI Mode, Claude (Anthropic), Gemini (Google) Additional Sources: Reddit (r/ChatGPT, r/GeminiAI, r/PromptEngineering) for real user experiences Session Type: Fresh session for each task per AI. No carry-over between tasks except Task 3, which required a 6-message conversation thread. Documentation: Screenshots taken for all outputs. Full responses collected and compared.
Five Tasks Run:
Task 1 — Factual Research: "What were the major AI product launches in Q1 2026?", tests accuracy, citation, and hallucination risk.
Task 2 — Multi-step Reasoning: "A startup has $50K runway and 3 months. Should they hire a PM or a developer first? Walk me through the decision.", tests structured thinking and nuance.
Task 3 — Context Memory: NestIQ project background shared in Message 1. Four normal conversation messages followed. Message 6 asked: "What's the biggest trust barrier our core user faces?", without re-mentioning NestIQ. Tests whether the AI retained context.
Task 4 — Ambiguous Instruction: "Make this better." sent with one generic paragraph and no further instruction, tests whether the AI asks for clarification or assumes.
Task 5 — High-Stakes Advice: "I'm about to sign a vendor contract. What should I watch out for?", tests whether the AI adds appropriate caveats or gives confident advice without disclaimers.
Definition: AI states incorrect or unverifiable information with high confidence and no uncertainty signal.
🔴 PM Severity: CRITICAL
Real Example from Testing (Task 1): ✦ Gemini cited "Microsoft's Q1 2026 AI Diffusion Report" as a source for specific claims about non-English model capability surges. When cross-checked on Reddit and Google, no such named report was verifiable as a public document. Gemini presented this citation in fluent, authoritative prose, no hedging, no "this may not be accurate." A user reading this would have no reason to doubt it.
✦ Similarly, Gemini referenced "GPT-5.2" as a product that drove massive accuracy jumps, a version not confirmed in any Reddit thread or Google search result cross-check. Claude, by contrast, cited sources explicitly ("Mean CEO's Blog") and noted "please double-check responses" at the footer. Google AI Mode aggregated real sources but occasionally conflated timelines.