Hi,
I'm looking for something to do automated testing.
To have a good confidence that a change improves the gameplay, I think you need to run hundreds maybe thousands of games.
So I would like software for that. Also with elo estimation.
Does anyone know something that runs on Linux?
automated testing
automated testing
https://www.vanheusden.com/
https://github.com/folkertvanheusden/
https://github.com/folkertvanheusden/
Re: automated testing
Well by now I wrote a simple python script for this but if something better is available, I'd like to know.flok wrote: ↑Fri Oct 29, 2021 10:26 pm Hi,
I'm looking for something to do automated testing.
To have a good confidence that a change improves the gameplay, I think you need to run hundreds maybe thousands of games.
So I would like software for that. Also with elo estimation.
Does anyone know something that runs on Linux?
My program runs multiple go-"engines" in parallel and then (using gnugo) writes the final score (only the score) to a pgn-result-file. This can then be processed using bayeselo or ordo.
E.g.:
Code: Select all
# PLAYER : RATING POINTS PLAYED (%)
1 GnuGO level 0 > : 5758.3 2460.0 2460 100
2 Stop : 4396.0 1644.0 2468 67
3 Daffy Baduck : 1606.5 800.0 2452 33
4 Donald Baduck : 897.5 14.0 2456 1
Code: Select all
Took 2098.560272s, cpu usage: 41426.270000s (19.740329)
https://www.vanheusden.com/
https://github.com/folkertvanheusden/
https://github.com/folkertvanheusden/
-
- Posts: 202
- Joined: Tue Feb 12, 2008 8:31 pm
- Contact:
Re: automated testing
I recommend reading the Chess Programming Wiki: https://www.chessprogramming.org/Match_Statistics
In particular, early stopping with SPRT is an important idea.
If more than two programs are playing, then a statistical model such as Elo is useful. But I don't like it much because the Elo model makes assumptions that are sometimes not very accurate. Using statistics based on match results between two programs is cleaner.
In particular, early stopping with SPRT is an important idea.
If more than two programs are playing, then a statistical model such as Elo is useful. But I don't like it much because the Elo model makes assumptions that are sometimes not very accurate. Using statistics based on match results between two programs is cleaner.
Re: automated testing
And that's what SPRT is doing? I'll look into it.Rémi Coulom wrote: ↑Sat Oct 30, 2021 3:43 pm I recommend reading the Chess Programming Wiki: https://www.chessprogramming.org/Match_Statistics
In particular, early stopping with SPRT is an important idea.
If more than two programs are playing, then a statistical model such as Elo is useful. But I don't like it much because the Elo model makes assumptions that are sometimes not very accurate. Using statistics based on match results between two programs is cleaner.
Sofar the elo system I'm using looks promising in the sense that a change that I think should help often shows that in an elo increase. For me it is not important (yet) to know absolute ratings, only if a change makes it play better.
Code: Select all
Rank Name Elo + - games score oppo. draws
1 GnuGO level 0 2269 -198 204 178903 99% 596 0%
2 Pachi pattern 2072 -2 27 3303 84% 957 0%
3 AmiGo 1622 7 7 170054 81% 728 0%
4 donaldbaduck-7be77f6 1348 12 12 10000 60% 1197 4%
5 donaldbaduck-b13bcaa 892 7 7 23786 42% 1145 0%
6 Stop 867 5 5 179040 57% 865 0%
7 donaldbaduck-134d100 805 9 9 16684 62% 700 0%
8 donaldbaduck-31e424f 712 8 8 15402 39% 1036 0%
9 donaldbaduck-22fd0f1 632 35 34 770 53% 725 1%
10 donaldbaduck-0e65907 618 5 5 80328 50% 802 0%
11 Daffy Baduck 333 4 4 178880 30% 966 0%
12 donaldbaduck-dc92741 64 4 4 130310 19% 924 0%
13 donaldbaduck-03db9de 6 13 13 7667 15% 800 1%
14 donaldbaduck-18c867d -47 8 8 20302 14% 811 0%
15 Donald Baduck -191 5 5 138601 5% 1048 0%
16 /home/folkert/Projects/baduck/build/src/donaldbaduck -195 235 428 44 0% 1267 0%
17 donaldbaduck_ffcb9be -220 12 12 11828 8% 811 0%
https://www.vanheusden.com/
https://github.com/folkertvanheusden/
https://github.com/folkertvanheusden/
Re: automated testing
Yet on cgos:flok wrote: ↑Tue Nov 02, 2021 9:46 amCode: Select all
Rank Name Elo + - games score oppo. draws 4 donaldbaduck-7be77f6 1348 12 12 10000 60% 1197 4% 6 Stop 867 5 5 179040 57% 865 0%
Code: Select all
— Stop-0.9-005 1061 247837 2021-11-02 12:51:22
1232142 donaldbaduck7be77f 542? 60 2021-11-02 12:39:19
[...]
Yes, stop in the top is version 1.something.
Last edited by flok on Wed Nov 03, 2021 8:17 pm, edited 1 time in total.
https://www.vanheusden.com/
https://github.com/folkertvanheusden/
https://github.com/folkertvanheusden/
Re: automated testing
Would it make sense to include self-play in the statistics?
My manager now plays n * (n - 1) where -1 to prevent self versus self.
My manager now plays n * (n - 1) where -1 to prevent self versus self.
https://www.vanheusden.com/
https://github.com/folkertvanheusden/
https://github.com/folkertvanheusden/
-
- Posts: 202
- Joined: Tue Feb 12, 2008 8:31 pm
- Contact:
Re: automated testing
Self-play is better than nothing if you have no opponents close to your strength. I did it a lot with Crazy Stone.