Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

twister test case selection numbers don't make any sense #30100

Closed
andrewboie opened this issue Nov 18, 2020 · 4 comments · Fixed by #31999
Closed

twister test case selection numbers don't make any sense #30100

andrewboie opened this issue Nov 18, 2020 · 4 comments · Fixed by #31999
Assignees
Labels
area: Twister Twister bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@andrewboie
Copy link
Contributor

Let's take this as an example:

INFO    - Zephyr version: zephyr-v2.4.0-1654-g52dd8e60a1
INFO    - JOBS: 60
INFO    - Selecting default platforms per test case
INFO    - Building initial testcase list...
INFO    - 14202 test configurations selected, 298472 configurations discarded due to filters.
INFO    - Adding tasks to the queue...
INFO    - Total complete:  211/ 211  100%  skipped:   49, failed:    0
INFO    - 162 of 162 tests passed (100.00%), 0 failed, 14040 skipped with 0 warnings in 139.41 seconds
INFO    - In total 1634 test cases were executed on 307 out of total 307 platforms (100.00%)
INFO    - 0 tests executed on platforms, 162 tests were only built.

First we have reported that 14202 tests were selected, with 298472 discarded. These two numbers have no relationship to anything that follows.

We then get told:

  • 211 / 211 tests were run
  • 162 / 162 passed. What about the other 49 tests? Were they build only?
  • 14040 were skipped, but this doesn't seem to have any obvious relationship to the 14202 or 298472 numbers reported easrlier
  • 1634 test cases were executed. I thought we did 211 tests?
  • 162 tests were only built. But I thought 162 tests were run in emulators and passed? What does this 162 number mean?

From a holistic point of view these numbers really don't make much sense. It would be better to report stuff in a way where the relationships were clearer, and we have an clear understanding what any given value means and what it might be a subset of.

@andrewboie andrewboie added Enhancement Changes/Updates/Additions to existing features area: Sanitycheck Sanitycheck has been renamed to Twister labels Nov 18, 2020
@nashif nashif assigned PerMac and unassigned nashif Nov 19, 2020
@nashif nashif added bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug and removed Enhancement Changes/Updates/Additions to existing features labels Nov 19, 2020
@PerMac
Copy link
Member

PerMac commented Nov 19, 2020

I will copy here a comment I made some time ago. This might answer some of your questions [tl,dr nomenclature used in reporting is confusing and different names are used for the same thing]. Don't take it as a final answer, it is just to explain how I saw this back then:). We will discuss it during the next tester wg meeting to see if we can agree on sorting this up. There might be some extra issues with counting, will verify this as well. I will create a PR fixing this. If you have some use cases which give wrong counting it would be beneficial if you add them here, it would be easier for me to debug on this faulty behavior.

/home/maciej/zephyrproject2/zephyr/sc-venv/bin/python3.7 /home/maciej/zephyrproject2/zephyr/scripts/sanitycheck --build-only -T samples/hello_world/ --all --subset 2/120
Renaming output directory to /home/maciej/zephyrproject2/zephyr/sanity-out.1
INFO - Running only a subset: 2/120
INFO - JOBS: 8
INFO - Selecting all possible platforms per test case
INFO - Building initial testcase list...
INFO - 3 test configurations selected, 8 configurations discarded due to filters.
INFO - Adding tasks to the queue...
INFO - 2 of 2 tests passed (100.00%), 0 failed, 1 skipped with 0 warnings in 3.15 seconds
INFO - In total 2 test cases were executed on 269 out of total 272 platforms (98.90%)
INFO - 0 tests executed on platforms, 2 tests were only built.
INFO - Total complete: 2/ 2 100% skipped: 0, failed: 0

Process finished with exit code 0

I find the line: In total 2 test cases were executed on 269 out of total 272 platforms (98.90%) confusing. It is not possible that only 2 test cases are run on 269 platforms. The issue is that 269 is the total number of platform preselected: len(self.selected_platforms) but then we only chose 3 platforms out of them (and 1 is skipped later on).
I think we should be more descriptive adding the info about the further limiting of the platforms. I think this number also gives the wrong impression, it looks like we've tested 98.9% platforms, but vast majority was just skipped in fact.

Another issue is how we count tests that were only built. In the line INFO - In total 2 test cases were executed on 269 out of total 272 platforms (98.90%) these 2 test cases were only built. It is also incoherent with the next line: 0 tests executed on platforms, 2 tests were only built. were build-only tests are subtracted

The last issue is that we use different naming for the same stuff. Test configurations is in fact the same as tests as both of them refers to len(suite.instances) in code. IMO "test suites" from the testing language corresponds the best to what we are counting there. I guess "test suite" is not used since we have TestSuite class in the code which is even something different (it is a suite of all test suites).

So we have test configurations and tests that correspond to instances in the code and test suites in testing language. Test cases are at least coherent everywhere (but I am not sure the difference between test and test case is obvious) ;) Or did I miss something? I think sorting this nomenclature would be beneficial for anyone who would try to work with sanitycheck.

@nashif
Copy link
Member

nashif commented Nov 19, 2020

Agree this is starting to get confusing and we are mixing old with new terminologies.Some language can be changed here and we probably can drop some of the detailed statistics which might lead to confusion. In a nutshell however and historically a test in sanitycheck lingua is:

a testcase.yaml entry executed on a platform

Let's dissect this with an example:

sanitycheck -T tests/kernel/threads/thread_init/ -v
Renaming output directory to /home/nashif/Work/zephyrproject/zephyr/sanity-out.2
INFO    - Zephyr version: zephyr-v2.4.0-1701-g414ba6b074
INFO    - JOBS: 20
INFO    - Selecting default platforms per test case
INFO    - Building initial testcase list...
INFO    - 24 test configurations selected, 284 configurations discarded due to filters.
INFO    - Adding tasks to the queue...
INFO    -  1/22 frdm_k64f                 tests/kernel/threads/thread_init/kernel.threads.init PASSED (build)
INFO    -  2/22 qemu_x86_64               tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 6.134s)
INFO    -  3/22 qemu_x86_64_nokpti        tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 6.137s)
INFO    -  4/22 qemu_nios2                tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 6.032s)
INFO    -  5/22 qemu_cortex_r5            tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 5.066s)
INFO    -  6/22 qemu_cortex_a53           tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 3.631s)
INFO    -  7/22 qemu_x86_coverage         tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.241s)
INFO    -  8/22 qemu_x86                  tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.187s)
INFO    -  9/22 qemu_x86_nopae            tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.171s)
INFO    - 10/22 qemu_x86_nokpti           tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.180s)
INFO    - 11/22 qemu_x86_tiny             tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.170s)
INFO    - 12/22 mps2_an521                tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.069s)
INFO    - 13/22 mps2_an385                tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.035s)
INFO    - 14/22 qemu_cortex_m0            tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.024s)
INFO    - 15/22 qemu_x86_nommu            tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.130s)
INFO    - 16/22 qemu_xtensa               tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.023s)
INFO    - 17/22 qemu_riscv64              tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.015s)
INFO    - 18/22 qemu_riscv32              tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.016s)
INFO    - 19/22 qemu_leon3                tests/kernel/threads/thread_init/kernel.threads.init PASSED (qemu 2.031s)
INFO    - 20/22 nsim_em7d_v22             tests/kernel/threads/thread_init/kernel.threads.init PASSED (nsim 1.471s)
INFO    - 21/22 native_posix              tests/kernel/threads/thread_init/kernel.threads.init PASSED (native 0.004s)
INFO    - 22/22 nsim_em                   tests/kernel/threads/thread_init/kernel.threads.init PASSED (nsim 1.441s)

INFO    - 22 of 22 tests passed (100.00%), 0 failed, 2 skipped with 0 warnings in 19.32 seconds
INFO    - In total 88 test cases were executed on 24 out of total 307 platforms (7.82%)
INFO    - 21 tests executed on platforms, 1 tests were only built.
  • This is one test folder with one test entry in testcase.yaml which has 4 sub-tests (ztest tests).
  • There are 24 default platforms, we are calling sanitycheck with no platforms filters, so we will attempt to execute on all 24.

24 test configurations selected, 284 configurations discarded due to filters.

  • 1 test folder (application) with 1 testcase.yaml entry on 24 platforms: 1 x 1 x 24 = 24 configurations.
  • 284 here is for all the platforms that are not default, those are discarded by pre-filtering them.

22 of 22 tests passed (100.00%), 0 failed, 2 skipped with 0 warnings in 19.32 seconds

  • In the runtime filter we determine we can't run on 2 (those require different toolchain), so 22 are executed and pass, 2 are skipped

In total 88 test cases were executed on 24 out of total 307 platforms (7.82%)

  • every test has 4 sub-tests, so 22 x 4 = 88 tests are executed on the target 24, out of 307 total platforms.

21 tests executed on platforms, 1 tests were only built.

  • Out of the 22, 21 actually ran, 1 was just built.

@PerMac
Copy link
Member

PerMac commented Jan 4, 2021

I started a gh discussion to sort this issue and define a consistent nomenclature.
#31090
Please, have a look at the presentation there and let me know if it is useful and what do you think about the proposed naming convention.

@nashif nashif added area: Twister Twister and removed area: Sanitycheck Sanitycheck has been renamed to Twister labels Jan 11, 2021
@carlescufi carlescufi added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Jan 26, 2021
@carlescufi carlescufi changed the title sanitycheck test case selection numbers don't make any sense twister test case selection numbers don't make any sense Jan 26, 2021
@PerMac PerMac linked a pull request Feb 4, 2021 that will close this issue
@PerMac
Copy link
Member

PerMac commented Feb 4, 2021

@andrewboie Can you have a look at #31999 Does the output make more sense with it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Twister Twister bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants