Let's Talk About AI's Progress, From Word Games to Doing Its Own Research

From solving elementary school math problems, to winning International Olympiad gold medals, to independently proving conjectures that have stumped the math world for decades—AI's "cognitive evolution" might be happening much faster than you think.

Hold on, let's start from kindergarten

You might be wondering: why start a discussion about cutting-edge AI research with elementary school math?

But if you think about it, this is how human children learn math—first, they learn to compare sizes, then use a ruler to measure length, and only then do they move on to arithmetic, unit conversions, and so on. You can't skip a single step. AI is actually the same.

Early large language models (the ChatGPT type) were essentially playing a super-complex "word chain" game. You give it a sentence, and it guesses the next most likely word based on statistical probability. This works fine for writing articles, but ask it to do logical reasoning? For example, "A rope measures 30 centimeters. How many decimeters is that?"—early models would often spout utter nonsense with a straight face.

Why? Because it was missing the most fundamental piece of the puzzle: deterministic logic. The kind of hard rule that says "1+1 must equal 2," not "it probably equals 2."

If an AI system can't even handle basic unit conversions or proportional relationships, how can you expect it to prove mathematical theorems? In your dreams.

So, for AI to evolve, the first step was to transform from a "probabilistic word-guessing machine" into a "logical reasoning engine." Only by mastering these foundational, deterministic rules could the more advanced operations become possible.

"Deep Research": AI Learns to Do Its Own Research and Write Reports

Alright, so it's passed the basic logic test. Next, AI did something even more mind-blowing—it learned to conduct research autonomously.

In February 2025, OpenAI officially launched its "Deep Research" feature. This isn't your typical ask-a-question, get-an-answer chat mode. Instead, you give it a research topic, and it gets to work on its own:

First, it clarifies what you actually want—it doesn't just rush in after getting the topic. It asks you a few questions to confirm the research direction.
It breaks down the task itself—it deconstructs a large problem into a tree of sub-problems, building the framework before filling in the details.
It searches like crazy—it runs dozens of search queries online, adjusting keywords as it goes. Hits a paywall? It finds a way around. The results are junk? It tries a new angle.
It performs deep analysis—it can not only read web pages but also parse PDFs, analyze charts, and even write its own Python code to run data analysis.
It writes a report—finally, it organizes everything into a complete report with citations and a clear logical chain.

In practice, it can do in 30 minutes what takes a human researcher 6-8 hours (Every's review). After a blind review, one professional said it was "better than what an intern would write." An architect used it to generate a 15,000-word building code compliance checklist, synthesizing information from 21 different sources and saving about 15-20 hours of grunt work.

Of course, it's not perfect. The system has hard-coded brake mechanisms—limits on the number of searches, runtime, and iterations. If it hits the limit before it's done? It will honestly submit a "progress report" rather than pretending it found everything.

Google isn't sitting still, either

Google's Gemini Deep Research took a different path—heavy integration with the corporate ecosystem.

OpenAI's Deep Research primarily scours the public internet for data, but Gemini can directly access your Gmail, Google Drive, Docs, Sheets, and even team chat logs (detailed introduction). You can tell it, "Help me do a competitive analysis," and it will simultaneously pull public market data while digging through your company's internal strategy memos and product comparison spreadsheets. Then, it synthesizes a report that bridges "public information + internal secrets."

For a business, this capability is a total game-changer. An analysis like this used to take a team several days; now, it's done with a single prompt.

A new colleague needs to get up to speed on a project quickly? Let Gemini sift through all the relevant emails, documents, and chat histories, and in five minutes, it will give you a summary of the project's history, context, and pending decisions.

A Comparison of Their Approaches

In short:

OpenAI: Like a hyper-logical independent investigative journalist, skilled at deep dives into the maze of public information. It's well-suited for scenarios requiring hardcore logical deduction and large-scale data computation.
Google: Like a super-assistant deeply integrated into your company, skilled at seamlessly connecting your internal knowledge with external intelligence. It's perfect for corporate and academic settings.

On the "Human-Level Evaluation" (HLE), an incredibly difficult multi-disciplinary reasoning benchmark, OpenAI scored 26.6%. Google later reached 46.4%. Don't be fooled by the low numbers—this is a test designed to push AI to its absolute limits.

From "Looking Up Info" to "Creating": AI Starts Proving Mathematical Theorems

As impressive as Deep Research is, it's essentially still organizing and synthesizing existing human knowledge.

But what happened next was fundamentally different—AI started to create new knowledge.

How do you test if an AI is truly "thinking" and not just regurgitating answers? The mathematics community provided the most brutal testing ground: make it prove theorems. Because in the world of pure mathematics, memorizing training data is useless; only logical deduction can get you through.

AlphaGeometry: A Dream Team-up of Neural Networks and Symbolic Engines

Google DeepMind created a system called AlphaGeometry that specializes in solving geometry problems at the International Mathematical Olympiad (IMO) level.

Its architecture is particularly elegant, essentially two "brains" working in concert:

The Neural Network Brain: Responsible for "intuition." Just like a human's flash of insight when doing a geometry problem—"What if I add an auxiliary line here?"—the neural network does just that, guessing which auxiliary constructions are most likely to lead to a solution.
The Symbolic Reasoning Brain: Responsible for "rigor." After receiving the neural network's guess, it uses cold, hard logic to verify it step-by-step, leaving no room for error.

The two brains take turns: the symbolic engine deduces as far as it can, and when it gets stuck, it lets the neural network add an auxiliary point. Then the symbolic engine continues, and so on, until a solution is found.

The results? It solved 25 out of 30 IMO-level geometry problems, whereas the previous best method could only solve 10. It even discovered a more generalized version of a 2004 IMO theorem—in effect, it "invented" a new theorem on its own.

July 2025: AI Wins an IMO Gold Medal

And then came the climax of the story.

In July 2025, Google DeepMind announced that its Gemini model, equipped with the "Deep Think" feature, had achieved a gold medal standard at the IMO. It scored 35 out of a possible 42 points, perfectly solving 5 out of 6 problems covering algebra, combinatorics, geometry, and number theory.

The IMO president and coordinators independently reviewed the solutions using the same standards applied to human contestants, describing them as "stunning," "clear," and "precise."

This breakthrough was fundamentally different from previous attempts. Before, systems required human experts to spend days manually translating problems into a formal language before the AI could even start. This time, Gemini read the original problem text and autonomously wrote a complete proof within the official 4.5-hour competition time. It was completely end-to-end, with no need for a human translator.

How did it do it? DeepMind attributed the success to three key strategies:

Parallel Thinking: Instead of following a single path, it explores multiple solution paths simultaneously, verifying different hypotheses in parallel like a multi-core CPU.
Deep Reinforcement Learning: Trained on massive amounts of mathematical reasoning data, it learned not to get sidetracked or give up during long chains of reasoning.
Expert Knowledge Base: The model was fed a curated selection of high-quality mathematical solutions and problem-solving techniques, teaching it to "think like a top mathematician."

But let's not get too carried away—it scored zero points on the most difficult problem, number 6. On the extreme frontier, where that "flash of genius" is needed, AI still has significant blind spots.

Aletheia: AI Starts Doing "Real Research"

Winning an IMO gold medal is impressive, but at the end of the day, Olympiad problems are "designed puzzles"—they are guaranteed to have a solution, and the scope of knowledge is limited.

Real mathematical research is a completely different beast: there's no guarantee a problem is solvable, you have to find your own direction in a vast sea of literature, construct long-chain proofs that span multiple fields, and grope your way through completely unknown territory.

To conquer this ultimate challenge, DeepMind built Aletheia—named after the Greek goddess of truth. This isn't a problem-solver; it's a complete, autonomous mathematical research agent.

A Grand Sweep of 700 Unsolved Problems

In December 2025, Aletheia was let loose on Bloom's Erdos problems database, facing 700 conjectures in number theory and combinatorial geometry that remain unsolved by humans.

The results?

Aletheia itself identified 212 problems it believed it had solved.
After an initial human review, 63 of those looked plausible.
Top experts ultimately confirmed: 13 problems were correctly solved.

These 13 breakthroughs can be divided into four categories:

Fully Autonomous Solutions (2)—This is true creation. One of them, Erdos-1051, is considered a heavyweight milestone by the academic community: Aletheia ingeniously connected tools from different theoretical frameworks to completely solve a problem that had stumped mathematicians for decades.

Partial Breakthroughs (2)—When faced with complex conjectures containing multiple sub-problems, the AI successfully cracked key parts of them.

Independent Rediscoveries (4)—Aletheia, starting from scratch, derived perfect proofs using pure logic... only for human experts to later discover that these proofs already existed in some extremely obscure literature. In other words, the AI "reinvented the wheel" through pure reasoning, paralleling the thought processes of top mathematicians.

Literature Correction (5)—The AI discovered that these so-called "unsolved problems" had actually been solved by others long ago; the database was simply mislabeled. It essentially helped the academic community do a major house-cleaning.

Even More Frightening: The FirstProof Challenge

Aletheia also participated in the inaugural FirstProof Challenge, where it faced 10 frontier mathematical problems on which even human scholars had not reached a consensus. It autonomously solved 6 of them (problems 2, 5, 7, 8, 9, and 10).

Interestingly, for one problem (number 8), even the human expert judges started arguing among themselves, with sharply divided opinions on the AI's solution. What does this mean? AI's reasoning, in some dimensions, has already reached or even surpassed the boundaries of understanding for some human experts.

AI Becomes a Paper's First Author

Everything we've discussed so far has been about "solving problems." What comes next is the real watershed moment—AI independently wrote an academic paper ready for publication.

The paper, codenamed "Feng26," was generated entirely by Aletheia without any human guidance. It delved into the highly abstract field of arithmetic geometry, autonomously identified a research gap, performed the underlying calculations, designed a new proof, and finally wrote it all up into a manuscript that conforms to academic standards.

From identifying the problem to finishing the paper, there was zero human intervention from start to finish.

Of course, a more realistic model is human-machine collaboration. For example, in the paper "LeeSeo26," a human researcher and Aletheia worked together to tackle a complex problem concerning interacting particle systems—the AI acted as a hypothesis generator and logic verification engine, while the human guided the physical intuition and overall direction.

Then there's the "BKK+26" paper, which came about after Aletheia solved the Erdos-1051 conjecture on its own. Human mathematicians took a look, thought the proof method was brilliant, and collaborated with the AI to generalize it into a broader mathematical theorem.

But Problems Arose, Too

Now that we've covered the exciting parts, it's time to pour some cold water on the excitement.

The Risk of "Subconscious Plagiarism"

AI has ingested a massive amount of data during pre-training. Who can guarantee that what it "proves" isn't just something it implicitly memorized from some corner of its training data? The authors of the Aletheia papers specifically discussed this issue: by reviewing the reasoning traces, they confirmed that certain solutions were not direct retrievals from literature but were derived through internal reasoning. However, this verification mechanism is far from mature. The academic community needs more systematic auditing tools to distinguish between "truly new proofs" and "advanced copy-pasting."

Human Language is Too Vague

Aletheia frequently got stuck while processing some conjectures. The reason? Human mathematicians were too casual when writing down problems, omitting too much context and using a lot of colloquialisms that the AI's logic engine couldn't understand. In the end, human experts had to step in and act as "translators."

Who Gets the Credit?

When an AI contributes to a research outcome, how is intellectual property handled? The current practice is to use a joint authorship of "Human + System Name" and to disclose in detail the extent of the AI's involvement at each stage in the methods section. The academic community is also calling for the establishment of a more granular classification standard for "AI contribution levels."

Not Just Math: AI is Changing All of Science

The same underlying capabilities—massive data processing + rigorous logical reasoning + autonomous hypothesis verification—are being applied to many other fields:

Genomics: DeepMind's AlphaMissense can score millions of potential point mutations for pathogenicity, helping to screen for the causes of rare genetic diseases. AlphaGenome goes a step further, able to assess the potential impact of a genetic mutation on various molecular properties in under a second.
Earth Science: AlphaEarth Foundations fuses data from various satellites, radar, and lidar into a unified, high-dimensional "Earth embedding" representation. It maps the world's land and nearshore areas at a resolution of about 10x10 meters, accelerating ecological classification and environmental change analysis.
Weather Forecasting: WeatherNext replaces traditional supercomputer-based numerical models with AI. Its inference speed is about 8 times faster than traditional methods, allowing it to evaluate more possible scenarios in the same amount of time and improving its ability to capture extreme weather events.

These applications share the same evolutionary throughline as Aletheia: autonomously ingesting data -> performing rigorous logical analysis -> generating scientific conclusions that are beyond human processing capabilities.

Final Thoughts

Looking back at this evolutionary path:

Learn basic logic -> Learn to do research and write reports -> Learn to prove mathematical theorems -> Learn to conduct independent research and publish papers

Isn't this just the growth trajectory of a person from elementary school to a postdoc? Except AI did in a few years what takes humans decades.

The future probably looks something like this: humans will be responsible for proposing the "big questions" that require intuitive leaps, setting the direction of exploration, and defining the ethical boundaries. AI will be responsible for clearing all the technical hurdles, tirelessly calculating, reasoning, and verifying.

Humans will no longer need to spend vast amounts of time on the grunt work of data cleaning, literature review, and intermediate derivations. Our role will be that of the strategic commanders of a super-AI research team.

Is this exciting or unsettling? Probably a bit of both. But one thing is clear: this train has already left the station, and it's picking up speed.

Further Reading & References

Deep Research

Introducing Deep Research — OpenAI

How Deep Research Works — PromptLayer

Hands-on Review: "We Tried OpenAI's New Deep Research. Here's What We Found" — Every

Gemini Deep Research for Enterprise — Google Blog

A Guide to Gemini Deep Research — Skywork

Gemini Reaches 46.4% on HLE — LinkedIn

Mathematical Reasoning & Theorem Proving

AlphaGeometry Paper — Nature

AlphaGeometry Community Discussion — Reddit

Gemini with Deep Think Achieves Gold Medal Standard at IMO — DeepMind Official Blog

Comparing ChatGPT and Gemini's IMO Performance — Champaign Magazine

Aletheia & Autonomous Mathematical Research

Aletheia Agent Architecture — arXiv 2602.21201

Erdos Conjecture Experiments — arXiv 2601.22401 (PDF)

AI Paper Results Feng26 / LeeSeo26 — arXiv 2601.21442

FirstProof Challenge — ChatPaper

BKK+26 Discussion — Reddit r/singularity

Cross-Disciplinary Scientific Applications

AlphaMissense & Life Sciences — DeepMind

AlphaEarth Foundations Earth Mapping — DeepMind

WeatherNext AI Weather Forecasting — DeepMind