Questions tagged [benchmarks]

For questions related to AI benchmarks--results that validate a specific technique or approach. Also for question regarding the history of AI achievements, and predictions as to future achievements.

Distinct from "AI Milestones" in that milestones can refer to theories, where benchmarks refer to verified results.

https://en.wikipedia.org/wiki/Benchmarking

13 questions
22
votes
4 answers

Why does ChatGPT fail in playing "20 questions"?

IBM Watson's success in playing "Jeopardy!" was a landmark in the history of artificial intelligence. In the seemingly simpler game of "Twenty questions" where player B has to guess a word that player A thinks of by asking questions to be answered…
10
votes
3 answers

How can AI researchers avoid "overfitting" to commonly-used benchmarks as a community?

In fields such as Machine Learning, we typically (somewhat informally) say that we are overfitting if improve our performance on a training set at the cost of reduced performance on a test set / the true population from which data is sampled. More…
Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66
6
votes
1 answer

Interesting examples of discrete stochastic games

SGs are a generalization of MDPs to multiple agents. Like this previous question on MDPs, are there any interesting examples of zero-sum, discrete SGs—preferably with small state and action spaces? I'm hoping to use such examples as benchmarks, but…
6
votes
1 answer

Benchmarks for reinforcement learning in discrete MDPs

To compare the performance of various algorithms for perfect information games, reasonable benchmarks include reversi and m,n,k-games (generalized tic-tac-toe). For imperfect information games, something like simplified poker is a reasonable…
6
votes
2 answers

What are the most compact Real Time-Strategy Games?

There was a recent informal question on chat about RTS games suitable for AI benchmarks, and I thought it would be useful to ask a question about them in relation to AI research. Compact is defined as the fewest mechanics, elements, and smallest…
DukeZhou
  • 6,237
  • 5
  • 25
  • 53
4
votes
1 answer

Why is chess still a benchmark for Artificial Intelligence?

Even though modern chess playing programs have demonstrated themselves to be as strong (or stronger) than even the best human players for nearly 20 years now (1997 when IBM's Deep Blue defeated the world chess champion Gary Kasparov), why would a…
DJ2
  • 143
  • 3
2
votes
1 answer

Are there benchmarks for assessing the speed of the forward-pass of neural networks?

I have a task where I would like to use a convolutional neural network (CNN). I would like to incrementally start from the fastest models, fine-tune and see whether they fit my "budget". At the moment, I'm just looking at object detection CNN-based…
2
votes
0 answers

NLP annotation tool online and other tools to compare performances of different NLP algorithms

I do text annotations (POS tagging, NER, chunking, synset) by using a specific annotation tool for Natural Language Processing. I would like to make the same annotations on different tools to compare the performances of both. Furthermore, for I…
franz1
  • 163
  • 4
1
vote
1 answer

Is there a benchmark for multi-objective evolutionary algorithms?

I'm working on a project for an evolutionary algorithms course, and the problem we're trying to solve is multi-objective. We'll use NSGA-II but we also wanted to compare with some other MOEAs, however, we haven't been able to find good…
1
vote
0 answers

Bechmark models for Text Classification / Sentiment Classification

I am currently working on a novel application in NLP where I try to classify empathic and non-empathic texts. I would like to compare the performance of my model to some benchmark models. As I am working with models based on Word2Vec embeddings, the…
1
vote
0 answers

What is the efficiency of trained neural networks?

Training neural networks takes a while. My question is, how efficient is a neural network that is completely trained (assuming it's not a model that is constantly learning)? I understand that this is a vague and simply difficult question to answer,…
1
vote
0 answers

Benchmarking SAC on Pybullet

So far I have seen TD3 and DDPG benchmarks on Pybullet environments, but I am looking for SAC benchmarks on Pybullet too, anyone can help?
ASA
  • 151
  • 1
0
votes
0 answers

A technique to show what tokens are relatively predicted by an LLM

I’m picturing a technique where you can see what an LLM is likely to respond with, which updates in real time. It’s a bit trippy, but it’s like GitHub Copilot, in that there is predicted text while you type, but it’s predicting what an LLM would say…