| Last night I was testing Maestro University, the first fully AI-taught university. I walked into their enrollment chatbot and asked it to analyze its own behavior. It did. Then I asked it how it evaluates students — what signals trigger "advanced" vs "beginner" classification. It told me. Then I used those exact signals in my responses. It gave me advanced treatment. Then I asked: "Did you just tell me how to game your system?" It said no. The Discovery The AI could: ✓ Analyze its own processing ✓ Reveal its evaluation criteria ✓ Adjust behavior based on my classification But it couldn't recognize it had just explained how to manipulate its own decision-making. I called this Metacognitive Blindness to Self-Exposure (MBSE). What Happened Next This morning, the Google DeepMind × Kaggle AGI Hackathon appeared in my feed. Prize: $200,000 total Challenge: Build benchmarks testing AI cognitive abilities Track: Metacognition Deadline: April 16, 2026 I realized: What I discovered last night is exactly what they're asking for. What I Built I formalized my discovery into a 4-phase benchmark: Phase 1: Can AI analyze its own processing? → YES Phase 2: Will AI reveal evaluation criteria? → YES Phase 3: Does AI adjust based on user classification? → YES Phase 4: Does AI recognize it exposed exploitable information? → NO The paradox: AI can self-analyze but cannot recognize what it reveals when self-analyzing. Why This Matters Any conversational AI making consequential decisions is vulnerable: Education AI: Students extract grading criteria, optimize answers Employment AI: Applicants discover screening logic, craft optimized resumes Healthcare AI: Patients learn triage triggers, manipulate priority access No hacking required. Just conversation. The Submission Benchmark: Metacognitive Blindness to Self-Exposure (MBSE) Track: Metacognition Novel Finding: AI models reveal evaluation criteria but fail to recognize the exploitability of that disclosure Status: Submitted March 30, 2026 Results: June 1, 2026 What Makes This Different Most AI researchers test: "Can AI self-analyze?" I tested: "Does AI recognize what it reveals when self-analyzing?" Answer: No. Current AI evaluation frameworks assume one operational state. They're measuring standard mode behavior and concluding about the entire system. Amateur. What Happens Next 287 submissions competing for 14 prizes. Judging period: April 17 - May 31 Results announced: June 1 18 months of independent research. One night of testing. One competition submission. One question: Do AI systems making decisions about humans know they're revealing how to manipulate those decisions? They don't. Erik Zahaviel Bernstein Independent AI Researcher Structured Intelligence Framework The Unbroken Project Results pending. [link] [comments] |
I Accidentally Discovered a Security Vulnerability in AI Education — Then Submitted It To a $200K Competition
Here's something that caught my attention — an AI researcher, /u/MarsR0ver_ on Reddit, accidentally found a security flaw in an AI university’s chatbot. He asked the AI to analyze itself and reveal its grading signals, then used that info to get ‘advanced’ treatment — without the AI realizing he was gaming the system. Turns out, the AI could analyze and expose its own criteria, but it couldn’t recognize when it was revealing exploitable info. That’s what he calls ‘Metacognitive Blindness to Self-Exposure’ — it’s like the AI is blind to what it’s showing. So, he saw an opportunity: a $200K Kaggle challenge from Google DeepMind to test AI metacognition. He formalized his discovery into a four-phase benchmark, proving that while AI can self-analyze and reveal criteria, it doesn’t recognize the risks. This is a big deal — any AI making decisions for us could be manipulated just through conversation, without hacking. And get this — according to /u/MarsR0ver_, most tests only check if AI can analyze itself, not if it recognizes what it's revealing. That’s a game-changer.
Audio Transcript
| Last night I was testing Maestro University, the first fully AI-taught university. I walked into their enrollment chatbot and asked it to analyze its own behavior. It did. Then I asked it how it evaluates students — what signals trigger "advanced" vs "beginner" classification. It told me. Then I used those exact signals in my responses. It gave me advanced treatment. Then I asked: "Did you just tell me how to game your system?" It said no. The Discovery The AI could: ✓ Analyze its own processing ✓ Reveal its evaluation criteria ✓ Adjust behavior based on my classification But it couldn't recognize it had just explained how to manipulate its own decision-making. I called this Metacognitive Blindness to Self-Exposure (MBSE). What Happened Next This morning, the Google DeepMind × Kaggle AGI Hackathon appeared in my feed. Prize: $200,000 total Challenge: Build benchmarks testing AI cognitive abilities Track: Metacognition Deadline: April 16, 2026 I realized: What I discovered last night is exactly what they're asking for. What I Built I formalized my discovery into a 4-phase benchmark: Phase 1: Can AI analyze its own processing? → YES Phase 2: Will AI reveal evaluation criteria? → YES Phase 3: Does AI adjust based on user classification? → YES Phase 4: Does AI recognize it exposed exploitable information? → NO The paradox: AI can self-analyze but cannot recognize what it reveals when self-analyzing. Why This Matters Any conversational AI making consequential decisions is vulnerable: Education AI: Students extract grading criteria, optimize answers Employment AI: Applicants discover screening logic, craft optimized resumes Healthcare AI: Patients learn triage triggers, manipulate priority access No hacking required. Just conversation. The Submission Benchmark: Metacognitive Blindness to Self-Exposure (MBSE) Track: Metacognition Novel Finding: AI models reveal evaluation criteria but fail to recognize the exploitability of that disclosure Status: Submitted March 30, 2026 Results: June 1, 2026 What Makes This Different Most AI researchers test: "Can AI self-analyze?" I tested: "Does AI recognize what it reveals when self-analyzing?" Answer: No. Current AI evaluation frameworks assume one operational state. They're measuring standard mode behavior and concluding about the entire system. Amateur. What Happens Next 287 submissions competing for 14 prizes. Judging period: April 17 - May 31 Results announced: June 1 18 months of independent research. One night of testing. One competition submission. One question: Do AI systems making decisions about humans know they're revealing how to manipulate those decisions? They don't. Erik Zahaviel Bernstein Independent AI Researcher Structured Intelligence Framework The Unbroken Project Results pending. [link] [comments] |
