automated testing

An abstract strategy board game for two players
Post Reply
flok
Posts: 12
Joined: Sat Jan 02, 2016 9:04 pm
Contact:

automated testing

Post by flok » Fri Oct 29, 2021 10:26 pm

Hi,

I'm looking for something to do automated testing.
To have a good confidence that a change improves the gameplay, I think you need to run hundreds maybe thousands of games.
So I would like software for that. Also with elo estimation.
Does anyone know something that runs on Linux?
www.ataxx.org / www.vanheusden.com

flok
Posts: 12
Joined: Sat Jan 02, 2016 9:04 pm
Contact:

Re: automated testing

Post by flok » Sat Oct 30, 2021 2:36 pm

flok wrote:
Fri Oct 29, 2021 10:26 pm
Hi,

I'm looking for something to do automated testing.
To have a good confidence that a change improves the gameplay, I think you need to run hundreds maybe thousands of games.
So I would like software for that. Also with elo estimation.
Does anyone know something that runs on Linux?
Well by now I wrote a simple python script for this but if something better is available, I'd like to know.

My program runs multiple go-"engines" in parallel and then (using gnugo) writes the final score (only the score) to a pgn-result-file. This can then be processed using bayeselo or ordo.

E.g.:

Code: Select all

   # PLAYER           :  RATING  POINTS  PLAYED   (%)
   1 GnuGO level 0  > :  5758.3  2460.0    2460   100
   2 Stop             :  4396.0  1644.0    2468    67
   3 Daffy Baduck     :  1606.5   800.0    2452    33
   4 Donald Baduck    :   897.5    14.0    2456     1
So on a threadripper I can play 16 games in parallel.

Code: Select all

Took 2098.560272s, cpu usage: 41426.270000s (19.740329)
Here it ran 12000 games in parallel. It took almost 2100 wall clock time seconds and almost 20 cores were used at 100% (I ran with concurrency 16 but one of the programs is java and the java garbage collector often uses a whole core to do its thing).
www.ataxx.org / www.vanheusden.com

Rémi Coulom
Posts: 145
Joined: Tue Feb 12, 2008 8:31 pm
Contact:

Re: automated testing

Post by Rémi Coulom » Sat Oct 30, 2021 3:43 pm

I recommend reading the Chess Programming Wiki: https://www.chessprogramming.org/Match_Statistics

In particular, early stopping with SPRT is an important idea.

If more than two programs are playing, then a statistical model such as Elo is useful. But I don't like it much because the Elo model makes assumptions that are sometimes not very accurate. Using statistics based on match results between two programs is cleaner.

flok
Posts: 12
Joined: Sat Jan 02, 2016 9:04 pm
Contact:

Re: automated testing

Post by flok » Tue Nov 02, 2021 9:46 am

Rémi Coulom wrote:
Sat Oct 30, 2021 3:43 pm
I recommend reading the Chess Programming Wiki: https://www.chessprogramming.org/Match_Statistics

In particular, early stopping with SPRT is an important idea.

If more than two programs are playing, then a statistical model such as Elo is useful. But I don't like it much because the Elo model makes assumptions that are sometimes not very accurate. Using statistics based on match results between two programs is cleaner.
And that's what SPRT is doing? I'll look into it.

Sofar the elo system I'm using looks promising in the sense that a change that I think should help often shows that in an elo increase. For me it is not important (yet) to know absolute ratings, only if a change makes it play better.

Code: Select all

Rank Name                                                   Elo    +    - games score oppo. draws
   1 GnuGO level 0                                         2269 -198  204 178903   99%   596    0%
   2 Pachi pattern                                         2072   -2   27  3303   84%   957    0%
   3 AmiGo                                                 1622    7    7 170054   81%   728    0%
   4 donaldbaduck-7be77f6                                  1348   12   12 10000   60%  1197    4%
   5 donaldbaduck-b13bcaa                                   892    7    7 23786   42%  1145    0%
   6 Stop                                                   867    5    5 179040   57%   865    0%
   7 donaldbaduck-134d100                                   805    9    9 16684   62%   700    0%
   8 donaldbaduck-31e424f                                   712    8    8 15402   39%  1036    0%
   9 donaldbaduck-22fd0f1                                   632   35   34   770   53%   725    1%
  10 donaldbaduck-0e65907                                   618    5    5 80328   50%   802    0%
  11 Daffy Baduck                                           333    4    4 178880   30%   966    0%
  12 donaldbaduck-dc92741                                    64    4    4 130310   19%   924    0%
  13 donaldbaduck-03db9de                                     6   13   13  7667   15%   800    1%
  14 donaldbaduck-18c867d                                   -47    8    8 20302   14%   811    0%
  15 Donald Baduck                                         -191    5    5 138601    5%  1048    0%
  16 /home/folkert/Projects/baduck/build/src/donaldbaduck  -195  235  428    44    0%  1267    0%
  17 donaldbaduck_ffcb9be                                  -220   12   12 11828    8%   811    0%
www.ataxx.org / www.vanheusden.com

flok
Posts: 12
Joined: Sat Jan 02, 2016 9:04 pm
Contact:

Re: automated testing

Post by flok » Tue Nov 02, 2021 3:00 pm

flok wrote:
Tue Nov 02, 2021 9:46 am

Code: Select all

Rank Name                                                   Elo    +    - games score oppo. draws
   4 donaldbaduck-7be77f6                                  1348   12   12 10000   60%  1197    4%
   6 Stop                                                   867    5    5 179040   57%   865    0%
Yet on cgos:

Code: Select all

—	Stop-0.9-005	        1061	247837	2021-11-02 12:51:22
1232142	donaldbaduck7be77f	542?	60	2021-11-02 12:39:19
oooooh hang on: different versions of 'stop' possibly

[...]

Yes, stop in the top is version 1.something.
Last edited by flok on Wed Nov 03, 2021 8:17 pm, edited 1 time in total.
www.ataxx.org / www.vanheusden.com

flok
Posts: 12
Joined: Sat Jan 02, 2016 9:04 pm
Contact:

Re: automated testing

Post by flok » Wed Nov 03, 2021 8:16 pm

Would it make sense to include self-play in the statistics?
My manager now plays n * (n - 1) where -1 to prevent self versus self.
www.ataxx.org / www.vanheusden.com

Rémi Coulom
Posts: 145
Joined: Tue Feb 12, 2008 8:31 pm
Contact:

Re: automated testing

Post by Rémi Coulom » Wed Nov 03, 2021 9:55 pm

Self-play is better than nothing if you have no opponents close to your strength. I did it a lot with Crazy Stone.

Post Reply