LabAI — Testing

Optimizations

47

59.6% success rate

Total Spend

$1.7433

443,874 tokens

Avg Quality

8.26

out of 10

Models Used

15

13 failed runs

Quality Trend (30 days)

Daily Cost (30 days)

Model Performance

Avg Quality Score

Avg Cost per Call ($)

#	Model	Runs	Success	Failed	Avg Quality	Best	Worst	Avg Cost	Total Cost	Avg Tokens	Avg Latency
1	claude-opus-4.6 anthropic	8	4 (50%)	4	8.98	9.3	8.5	$0.009264	$0.0371	405	9,419ms
2	qwen3.6-plus qwen	60	60 (100%)	0	8.73	9.5	6.2	$0.004469	$0.2681	1,538	34,858ms
3	claude-sonnet-4.6 anthropic	44	34 (77%)	10	8.67	9.5	6.5	$0.010701	$0.3638	801	20,579ms
4	minimax-m2.7 minimax	24	24 (100%)	0	8.60	9.4	7.5	$0.003838	$0.0921	2,031	37,134ms
5	gpt-oss-120b openai	48	47 (98%)	1	8.60	9.5	7.5	$0.000393	$0.0185	1,264	19,155ms
6	claude-sonnet-4 anthropic	39	34 (87%)	5	8.43	9.5	6.0	$0.014988	$0.5096	1,087	23,860ms
7	mimo-v2-pro xiaomi	56	56 (100%)	0	8.38	9.5	6.5	$0.003597	$0.2014	936	12,229ms
8	gemini-3.1-pro-preview google	4	4 (100%)	0	8.25	9.5	7.0	$0.000000	$0.0000	980	20,091ms
9	gemini-3-flash-preview google	4	4 (100%)	0	8.00	9.5	6.0	$0.000000	$0.0000	192	3,481ms
10	gemma-4-31b-it google	8	7 (88%)	1	7.87	9.3	5.5	$0.007089	$0.0496	296	22,048ms
11	gemma-3-27b-it google	135	132 (98%)	3	7.82	9.0	5.5	$0.000086	$0.0114	615	13,773ms
12	gpt-4o openai	28	28 (100%)	0	7.58	8.5	7.0	$0.006523	$0.1826	716	7,912ms
13	hermes-4-70b nousresearch	4	4 (100%)	0	7.50	8.0	7.0	$0.000000	$0.0000	298	3,837ms
14	deepseek-v3.2 deepseek	4	3 (75%)	1	7.23	8.2	6.5	$0.000430	$0.0013	732	18,639ms
15	gpt-4o-mini openai	14	13 (93%)	1	-	-	-	$0.000595	$0.0077	1,070	13,824ms

Prompt Generation Templates 5

Template	Usage	Avg Quality	Avg Tokens	Avg Cost	Total Cost	Avg Latency
Default Prompt Engineering	3×	-	394	$0.004173	$0.0000	0ms
Constraint Explicit	2×	-	348	$0.000443	$0.0000	0ms
Contoh Pattern + Constraint Explicit	18×	-	1,738	$0.002900	$0.0015	6,611ms
Prompt Engineering Human-understandable	8×	-	357	$0.004672	$0.0000	0ms
Contoh Pattern	1×	-	699	$0.007329	$0.0000	0ms

Evaluation Templates 1

Template	Usage	Avg Quality	Avg Tokens	Avg Cost	Total Cost	Avg Latency
Default Evaluation	25×	-	7,933	$0.026246	$0.0131	4,171ms

Prediction Template Accuracy Ranking

Ranked by average rank-match accuracy vs actual optimization scores

#	Template	Runs	Avg Accuracy	Best Run	Worst Run	Total Rank Match
🥇	Specificity and Outcome Predictability Judge	2	62.5%	100.0%	25.0%	5 / 8 correct
🥈	Clarity and Communicative Effectiveness Judge	4	56.3%	100.0%	25.0%	9 / 16 correct
🥉	Structural Completeness Judge	1	25.0%	25.0%	25.0%	1 / 4 correct

Performance by Optimization Goal

Goal	Optimizations	Completed	Avg Quality	Total Cost	Avg Cost/Run
Balanced	41	26 (63%)	8.26	$1.4861	$0.003753
Quality maximal	6	2 (33%)	8.25	$0.2572	$0.004435