Claude3发布,国内外六款模型实战能力测试

Claude3发布,国内外六款模型实战能力测试


Claude 3 Release: A Comprehensive Test of Six Leading AI Models

Last night, Claude 3 was released, and I performed a simple test of the Sonnet model. Today, after subscribing to Claude 3 Pro, I unlocked the Opus model—which reportedly surpasses GPT-4 across various benchmarks. Initially, I intended only to compare it with GPT-4 and Gemini, but I decided to include Zhipu GLM4, iFLYTEK Xinghuo, and Baidu Wenxin Yiyan 4.0. I have selected eight examples, most of which are directly related to productivity, to evaluate these six models.

The response order for each question is: Claude 3, Gemini, GPT-4, Zhipu GLM4, iFLYTEK Xinghuo, and Wenxin Yiyan 4.0. For various reasons, in this public version, I will only post results without scores for certain models.

Warm-up: Image Understanding

Question 1

This is an example from Microsoft's comprehensive testing paper on the GPT-4 multimodal model. When GPT-4 Multimodal and Gemini (Bard) were first released, we ran a comparison, and both models were correct. Thus, I thought of this as a warm-up.

The task: Answer the question within the image (requires the model to correctly identify the question and provide the correct answer).

At that time, GPT-4 and Gemini (Bard) responded as follows:

Previous Results

Claude 3: The answer was perfect. 10/10.

Claude 3 Response

Gemini Ultra 1.0: It failed. It identified the question correctly but provided confident nonsense for the answer. We could call this a "hallucination," but per Hinton's terminology, I prefer "confabulation"—confidently wrong. Given the correct question identification, 4/10.

Gemini Response

GPT-4: Also failed. It even identified the question incorrectly, though if answered based on its own misidentification, the logic was correct. 4/10.

GPT-4 Response

Zhipu GLM4: Identified the question incorrectly, and the subsequent answer had nothing to do with the image.

GLM4 Response

iFLYTEK Xinghuo: Clearly misidentified the question, though it correctly mentioned the gray cat.

Xinghuo Response

Baidu Wenxin Yiyan 4.0: I dislike Baidu's interaction design. You have to manually select the "Image Interpretation" plugin and perform the upload and questioning in two separate steps. It was the only Chinese model to identify the question correctly, but the answer was wrong—trailing slightly behind Gemini and GPT-4.

Wenxin Response

Summary: This tests OCR and reasoning. Claude 3 takes full marks. For GPT-4, Gemini, and Wenxin 4, I believe the OCR was actually correct, but the language model was "led astray" by the prompt. Baidu's OCR remains top-tier in China.

Question 2: Counting Apples

In previous tests, GPT-4 was correct, and Gemini was wrong.

Apple Counting Image

Claude 3: Incorrect answer, but showed great imagination in spotting details. 4/10.

Claude 3 Response

Gemini Ultra: Incorrect. It even used search for some reason. 3/10.

Gemini Response

GPT-4: Failed again. Although GPT-4 usually has high accuracy, this time it consistently gave wrong answers (6 or 8). Given accurate question identification, 2/10.

GPT-4 Response

Zhipu GLM4: Correct, concise, and clean.

GLM4 Response

iFLYTEK Xinghuo: Incorrect; insisted on 8 apples.

Xinghuo Response

Wenxin Yiyan 4.0: Correct. A perfect answer that displayed rich "imagination."

Wenxin Response

Summary: Overseas models all failed; domestic models were 2 out of 3 correct.

Intermediate: Document Understanding

Question 3: Text Table Output

The table is from Microsoft's Sora paper, listing 45 rows and 5 columns of models.

Sora Table

Claude 3: Perfectly correct. One minor flaw: I asked in Chinese, but it returned the table in English. 10/10.

Claude 3 Response

Gemini Ultra: Very impressive. It was accurate and translated the content into Chinese. Highlights: 1. It offered an "Export to Sheet" button. 2. It added a column of accurate links to websites/papers. 10+2/10.

Gemini Response Gemini Link Feature

GPT-4: Flawless and correct, though English only. 10/10.

GPT-4 Response

Zhipu GLM4: Cut off at row 32, returned in English.

GLM4 Response

iFLYTEK Xinghuo: Display was incomplete and contained many alignment errors.

Xinghuo Response

Wenxin Yiyan 4.0: Missed two rows and had alignment issues.

Wenxin Response

Question 4: Financial Statements

A page from a corporate financial report. Harder due to the density of numbers.

Financial Statement

Claude 3: Virtually flawless, except for an "imagined" date. 10/10.

Claude 3 Response

Gemini Ultra: Misidentified the last row and had alignment errors—fatal for financial data. 5/10.

Gemini Response

GPT-4: Handed in a blank sheet. Its Tesseract OCR module crashed. 0/10.

GPT-4 Response

Domestic Models (GLM4, Xinghuo, Wenxin): All struggled with numerous numerical errors, with Wenxin being slightly better than the others.

GLM4 Response Xinghuo Response Wenxin Response

Advanced: Knowledge Extraction

Question 5: Architecture Diagram Recognition

Identify and explain the modules of the Sora architecture.

Sora Architecture

Claude 3: Excellent response, showing true "understanding." 9/10.

Claude 3 Response

Gemini Ultra: Identified correctly but lacked the structural depth of Claude 3. 7/10.

Gemini Response

GPT-4: A solid, high-quality summary. 10/10.

GPT-4 Response

Domestic Models: Mostly correct but too brief; GLM4 misidentified an encoder as a decoder.

Advanced: Simple Programs

Question 6: Scratch Image to Code

Scratch Logic

Claude 3: Correct logic with minor redundant assignments. 9/10.

Claude 3 Response

Gemini Ultra: Provided a literal translation of blocks rather than code. 6/10.

Gemini Response

GPT-4: Perfect score. Clean code and even executed the output correctly. 10/10.

GPT-4 Response

Domestic Models: GLM4 failed on boundary conditions (<=9 vs <9); Wenxin's answer was irrelevant.

Advanced: Data Analysis

Question 7: Chart Interpretation

The most difficult of the eight questions.

Chart Analysis

Claude 3: Data points were all wrong. 3/10.

Claude 3 Response

Gemini Ultra: 2/10.

Gemini Response

GPT-4: Dates and numbers were mismatched. 3/10.

GPT-4 Response

Real Productivity: img2code

Question 8: Generate Page Code from Screenshot

Dashboard Image

Claude 3: Commendable risk awareness; once clarified for public data, it generated solid d3.js code. 8/10.

Claude 3 Process

Gemini Ultra: Exceptional. Integrated real search results and data links. 10/10.

GPT-4: High code quality, though slightly less capable in search integration. 9/10.

Domestic Models: Very weak. Xinghuo's output was cluttered with repetitive UI text (Terms of Service, etc.).

Conclusion

Results: Claude 3 has indeed surpassed GPT-4 (63 pts). Gemini Ultra is on par with GPT-4 (49 vs 48). Domestic models still lag, though Wenxin Yiyan 4.0 performed the best among them.

Final Scores

Final Thoughts:

  1. Gemini feels like AlphaZero—it has the raw talent for "self-growth" but is still maturing in knowledge.
  2. Claude 3 lacks integrated search, which is a major constraint in productivity environments.
  3. Domestic models should focus more on data quality than marketing.
  4. GPT-4's stability is declining, which raises questions about hardware or optimization issues.
← Back to Blog