Hardware Testing and Benchmarking Methodology
On many hardware review websites explanations of the methodology used for testing and the amount data displayed in the review are lacking in quality and quantity. I'll attempt to show what needs to be addressed to improve the quality of these reviews, what you should find in high-quality reviews, and how to find misleading information.
Qualitative vs. Quantitative
When a reviewer states 'Product X is the fastest/most quiet/best value,' this is a qualitative statement. The reviewer is using statements that are unquantifiable, that is they can't be put into numbers. This is bad because these statements can't be tested or proven true, and therefore it misleads the reader.
However, these statements can be quantified, the reviewer might say 'Product X is the most quiet because it performed at 18 dB@1m at such and such speed compared to product Y (See Table Z).' This may not be as flashy or as eloquent as a simple 'This is the best' statement, but with correct testing methodologies to back up that sentence, it's actually telling you something about the item, rather than showing off the reviewer's writing style.
Testing can be completely ruined unless the number of variables involved are reduced to the absolute minimum number. For experiments, the ideal is to have only one variable. For example, voltage, ambient temperature, and airflow are controlled, while the temperature varies. In the real world having only one variable is not always possible, but the good experimenter attempts to reduce the effects of anything that might affect the test. This might be accomplished by turning disconnecting or turning off any other attachments to a computer, or removing un-necessary parts (such as a sound card or raid array) while testing a video card. When variables aren't kept in check, accuracy may be reduced, although the perceived precision may be high (see below).
Accuracy vs. Precision
This probably brings back vague memories of high school physics or chemistry for most readers, so a quick refresher. If something is accurate, it is close to the correct values, and if a result is precise it is repeatable and exact. Using a target as a metaphor, the bull's-eye is the 'correct' value, while the farther away from the center you are the less 'correct' the value is. Hover over each target with your mouse to see the explanation.
This relates to hardware testing somewhat abstractly, but the general idea is that testing conditions can allow for a high degree of precision (21.345 degrees Celsius for example) but can be completely wrong if not all variables are controlled (see above). On the flip side, the results can be correct, but not precise, however because of a large sample size the mean of the values obtained is close to the bull's-eye. Thus, precision does not have to be too great if the sample size is large enough. For hardware testers, this means that the most precise instrumentation available should be used, along with the greatest number of samples obtainable within a given time period. An accurate hardware review finds the data to agree closely with control values for the test and/or hypothesized or expected results for the test.
The basic idea here is that you don't keep more digits than you measured. For example, when using a ruler, you write down a number which takes into account the smallest divisions on the stick (such as millimeters) and then estimate the next digit. Thus, for a ruler with millimeter graduations a possible measurement might be 13.4mm . This last digit approximation isn't usually applicable to hardware testing, as most measuring is digital rather than analogue as with the ruler.
Also, when plugging raw numbers into equations or averaging data the calculator will sometimes give more digits than are significant. The calculator may give the mean as 3.3333333... but you only measured three significant digits in your data (3.32, 3.26, 3.45...) so the mean is correctly stated to be 3.33.
When a reviewer doesn't keep the correct number of significant digits (most often increasing the number of digits) it can lead to the idea that the data is more or less precise than in actuality.
Repeatability of Results
The repeatability of a test is the most telling gauge of how well an experiment has been done. In hardware testing, if the reviewer and a third party follow the guidelines on this page and get the same results, the test results are repeatable. This goes along with accuracy and precision, as more people repeat a test and the results from those tests land near the other tests, it can show the experiment's precision. If these new results are also near the control value 'bull's-eye', the experiment is also accurate.
The more times a test is done under the correct conditions, the more likely spurious results (small mistakes in each individual test) do not adversely affect the final averages (mean, median, mode etc.). Each instance of a product/item should be tested multiple times, and it is preferable to have more than one of the items to perform tests on, as some examples may perform better or worse than others.
Standard Deviation/Standard Error
The Standard deviation of a population is a measure of the data's spread of values. Standard error is the standard deviation over the root of the number of samples and results in a percentage. Thus, standard error takes into account the number of samples in the experiment (the more the better).
I'm oversimplifying, but a small standard deviation or standard error means that most of the values recorded during testing are within a small range of the arithmetic mean of the data. Basically, the smaller the standard deviation or standard error the more the numbers are to be trusted, and if standard error is used in place of standard deviation, the reviewer is unashamed of the sample size of the testing.
Often during an individual run of testing, small errors may be made (i.e. the tester forgets to remove a variable) are to be expected. With a number of tests and peer review (repeatability of results) these human errors can be reduced or eliminated.
Bias, a form of human error, is perhaps the biggest problem in hardware review. Many reviewers are given their review items for free and are lavished with praise and help from the companies that are being reviewed. Also, test items may be hand-picked by the company supplying them to be high performers, which distorts results. (This assumes that the reviewer is a third-party to begin with, and not paid by the manufacturer of the product.) The result of this is that reviews may be unfairly weighted to companies with more money to spend on and products to give away to willing reviewers.
Kevin C. Feb. 5th 2006 email@example.com
『For a period in the late Nineties denim became unfashionable. 501s, Levi's flagship brand, in particular suffered from the so-called 'Jeremy Clarkson effect', the association with men in middle youth.』 - Alex Benady