Artifacts Reviewing

I have consistently felt unease about our approach to examining artifacts. Although I have personally reviewed numerous artifacts, I believe that, as a community, we are mistaken in our methods.

Weird Errors and Artifact Problems

The objective of artifact creation is straightforward: to enable the reproducibility of results by allowing the same code to be run again in the future. This can be achieved by packaging all necessary components in a repository, such as Zenodo, ensuring that everything is self-contained and does not rely on external internet connections. Indeed, this approach seems quite appealing.

Yeah but not so much.

  1. Zenodo can delete artifacts without notifying users, like reported by Manuel Eberl.
  2. I encountered a behaviour difference whether the directory was put into the VM or shared. Another reviewer and I encountered an installation issue, but the artifact’s authors successfully identified the error source (impressive!). Although this was undoubtedly frustrating for them, I am concerned that similar elusive issues may arise again or go unnoticed in the opposite scenario (i.e., if the folder only functions properly when shared).
  3. I encountered another issue: a VM created using a VirtualBox version that is too recent (specifically, the latest release), and it is incompatible with the version included in the Ubuntu LTS distribution due to settings file differences. The two versions in question are likely separated by approximately two years. It does not appear that VirtualBox prioritizes compatibility with older versions, as this is not a stated goal of the project. Actually VirtualBox was also not running on my home computer, because the CPU was too new (thanks to the Archlinux wiki, I found the fix)
  4. It is unclear whether any artifact is useful or not. The idea is that, in the future, we will be able to use binaries. However, this is not obvious. For instance, if you want to run code on a cluster, such as the entire SAT Competition or even the entire SMT-Lib, it is not feasible to use VirtualBox on any cluster. Therefore, you would need to export the binary, which assumes compatibility with newer processors. Do you have an Apple M1 or a newer model? If so, you might be lucky with Rosetta, but this may not be the case. To make matters worse, running VirtualBox is currently broken on Apple computers. So if in 10 years we all run RISC-V CPU…
  5. I currently do not know anyone who used an artifact. Good maintained software does not need it. Bad maintained software… probably should not be used in the future anyway.

Reviewing is Broken

We publish artifacts with nice badges like from the ACM. Sounds great right?

Yeah but not so much.

  1. It’s not truly possible to reject an artifact. The worst badge one can receive indicates that the artifact was uploaded, which says nothing about its content. I tried once to reject an artifact because it was really bad, but failed (as no other reviewer had any comment to do after I clearly said I wanted to reject it).
  1. Reviewers pre-review the artifact, informing authors of what they need to fix during this process. Afterward, authors address the issues and reviewers conduct a second pass. I had to review an artifact where the authors did not even test it. However, I did during the pre-review, basically doing it for them. Then, the authors fixed the problems. Surprisingly, I had no problems with the artifact anymore (and the artifact was accepted).
  2. Reviews are not what one might expect: in most cases, the reviewer has little knowledge about the artifact’s topic, unlike the paper review. At best, as a reviewer, I can comment on the scripts, but I cannot comment on the actual code since it is not my area of expertise. Moreover, as a reviewer, I must read the paper to understand what is claimed often without really understanding the topic.
  3. Many conferences review the artifact simultaneously with the paper using different reviewers who do not communicate with each other. Therefore, even if a bug is found in the artifact, it is unlikely to affect the paper’s acceptance.
  4. Zenodo is not ideal for rejected papers. As a reviewer, I can see the various versions, enabling me to guess where it was previously submitted. A few years ago, only the final version of the artifact had to be on Zenodo or another platform. Now, even the reviewing version must be included.
  5. Few people seem willing to review artifacts. In fact, I believe that at least 30% of all previews are missing, and many artifacts are accepted with only two reviewers, as the third never reviewed the artifact. To address this issue, artifact evaluation committees are considering younger and younger PhD students (and yes, I am aware that for machine learning conferences, it is normal for first-year PhD students to review, but not in our community). My impression is that the review quality is not increasing and the number of missing reviews is increasing over time. And more and more late PhD students do not want to review anymore.

What Should we Do?

I really do not know, but here is what I think is important:

  1. drop the available badge.
  2. past reviews of artifact must be included in the submission. Great artifacts remain great.
  3. pre-reviewing should be able to lead to a rejection.
  4. reviewers of the paper should also review the artifact (but only if it has a chance to be accepted).
  5. tool papers should only be accepted with artifact if both the paper and the artifact are good enough.

I am not convinced that these points are enough, but it would be a start.

On my side, I will continue to review artifacts, but there is one artifact committe I will never be part again, after failing to reject that one artifact.