DeepSeek V4 vs Kimi K2.6 vs Qwen 3.6 vs GLM 5.1 — какой AI кодит реальные проекты?

Name: DeepSeek V4 vs Kimi K2.6 vs Qwen 3.6 vs GLM 5.1 — какой AI кодит реальные проекты?
Uploaded: 2026-05-07T13:55:21.000Z
Duration: 1 h 6 min 32 s

Introduction to the Experiment

Overview of the Coding Challenge

The video introduces a comparison between different AI models in coding, specifically focusing on their ability to produce deployable code versus subpar code that should not be used in real projects.

Edward Grishin, with 12 years of experience in IT and founder of Futuro AI, outlines his methodology for evaluating these models using a starter kit containing 7,600 lines of code.

Evaluation Criteria

The experiment involves adapting the AI models to a furniture niche for creating and consulting purposes without any prompts or adjustments. Each model's output will be evaluated based on eight criteria: security, tests, architecture, documentation, and four additional parameters. Each criterion is rated from 0 to 10 for a maximum score of 80 points.

Tools and Models Used

Description of Tools

The primary tool used is Claude Code (Clot Oppus), which serves as the main working instrument for real projects rather than content creation. This highlights the practical application of Chinese models in business cases rather than theoretical scenarios like website creation or gaming.

Open source analysis tools are introduced as part of the evaluation process; they allow connection to various models through an aggregator called Open Router at a low cost compared to Western counterparts.

Models Under Review

Four key AI models are being tested:

Deepsic V4 Pro: Features 600 parameters with context up to one million tokens.

Kika 2.6: A native agent model with one trillion parameters.

Quen 3.6: Developed by Alibaba with dense capabilities but fewer parameters.

GLM 5.1: With over 754 billion parameters and MIT licensing, it shows promising results against competitors like GPT-5.5 and Clot Oppus.

Performance Comparison

Benchmarking Results

Initial benchmarks show that Chinese models have caught up significantly with GPT-5.5 while offering lower costs—$1 per million input tokens compared to $100 subscription fees for Western models like GPT-5.5 or $20 for basic subscriptions.

Deepsic V4 scored higher than Claude and GPT on certain metrics despite its lower price point; Quen performed poorly due to integration issues but has advantages in local deployment capabilities compared to others like GLM which can operate autonomously for extended periods before needing intervention.

Testing Phase

Model Outputs Review

After running each model through identical prompts under controlled conditions, results were gathered including links generated by each model that could be accessed via localhost setups for further testing on functionality such as admin panels and widgets created by them.

Security Testing Attempts

Attempts were made to breach security protocols using prompt injections; however, all attempts resulted in blocks indicating robust defenses against such vulnerabilities across all tested models—showcasing their effectiveness in maintaining operational integrity under pressure from potential exploits.

Code Quality Assessment

Self-Evaluation Process

Each model was tasked with evaluating its own code quality based on eight specific criteria leading into detailed assessments provided by both themselves and external evaluations from GPT-5.5 and OPUS4 .7 . This dual-layered review aimed at ensuring objectivity while highlighting strengths and weaknesses inherent within their respective outputs.

Key Findings from Evaluations

GLM's Self-Evaluation:

Scored safety at only four out of ten due primarily to data handling issues but noted strong modularity thanks largely due its starter kit structure.

GPT’s Assessment:

Provided similar scores reflecting critical problems yet acknowledged architectural strengths present within GLM’s design framework.

Overall findings indicated that while there were variances among self-assessments versus external evaluations , trends showed consistent recognition towards areas requiring improvement across all platforms involved .

Security and Code Quality Assessment

Overview of Code Evaluation Metrics

The evaluation covers various aspects such as security (6), modularity (7), tests (6), documentation (7), handling (7), production readiness (6), code quality (7), and functionality (7) totaling 53 points.

There are critical issues identified, particularly in SQL injection prevention and parameterization of queries. The overall score is slightly inflated at 58 points, indicating some concerns with the assessment's objectivity.

Detailed Findings on Security and Modularity

The report highlights significant problems in security and modularity, with three test failures noted. Documentation issues include missing channels in the README file.

Positive aspects include a strong anti-injection architecture; however, there are recommendations for improvement regarding documentation clarity and specific prompts.

Code Review Insights

Summary of Review Scores

A review yields scores: security (5), modularity (8), tests (8), documentation (3), handling (7), production readiness (6), code quality, functionality totaling 49 out of 80.

Concerns arise about why the admin panel failed to open, suggesting potential issues with hardcoded secrets.

Critical Issues Identified

OPUS rated this code at 52 points despite acknowledging critical problems like secret management. This raises questions about the validity of high scores given by OPUS.

Disappointment is expressed over OPUS's evaluation criteria, which seem inconsistent with observed vulnerabilities.

Comparative Analysis of Models

Performance Comparison Among AI Models

GLM 5.1 emerges as a preferred model due to its speed, cost-effectiveness, and quality assessments compared to other models like GPT and OPUS.

Despite GLM’s lower implementation costs ($5 vs. $10+ for others), skepticism remains regarding the actual quality of assessed code.

Future Directions in Code Reviews

GPT 5.5 shows promise in conducting effective code reviews; however, doubts linger about its scoring accuracy based on recent evaluations.

An invitation for viewer feedback on previous comparisons between models indicates an ongoing interest in community engagement around these tools.

Conclusion and Community Engagement

Call to Action for Viewers

Subscribers are encouraged to join discussions via Telegram for more insights into live projects while emphasizing community interaction through comments.