A complete list of the checks I make when comparing scorecards.
If you’ve had a look around the site you may seen mention of various different “statuses”, but what do they actually mean? Well let me tell you. There are 4 statuses that both a match, and an individual comparison can have, Complete match, Approved match,, Leeway match, and Mismatch. The status for a match is determined to be the worst status for discrepancies found for the match, so a store with an
Approved match, 3
Complete match, and 2
Leeway match statuses would have a match status of
If a comparison finds that a particular piece of information matches across all scorecards it’s part of then the system regards that as a
Complete match. If this status occurs for individual comparisons I skip the comparison when generating the page for a match, as I’m only concerned with showing non-matching data.
When a comparison has a status of
Approved match this means that a discrepancy was found but that, after looking into it, I regard it as a reasonable difference. Examples of this are for things such as the number of batters in an innings where concussion substitutions (or X-player replacements in the Big Bash) take place, as there is no agreed method by which the substituted player is referred to, or occasions where a source links to the wrong player (where names are similar) but the details are otherwise correct.
When a comparison has a status of
Leeway match this means that while I couldn’t find a perfect match between the various scorecards for data being checked the worst discrepancies I found were down to acceptable differences between fields such as strike rate, or economy rate. For example, different sites round strike rates differently, meaning that one site might report a strike rate as 123.13 and another site as 123.14. I allow a “leeway” of 0.01 for these certain fields, and treat any discrepancies that are within this leeway as being
Leeway match discrepancies.
Mismatch is recorded when I find a non-acceptable discrepancy between the data from th scorecards for a match. An example might be a scorecards showing a different number of runs scored by a batter, or different types of extras for an innings.
Match information checks
This check ensures that the result listed on each scorecard actually matches. I do some normalisation of the result to make it easier to compare, for example by tweaking the names, and using the single form
D/L method for the various versions of Duckworth-Lewis(-Stern).
This check ensures that the teams listed on each scorecard actually match.
As well as checking that the data already mentioned is correct for the overall scorecard, I also perform a number of checks on each innings within a scorecard.
Number of innings
Ensures that each scorecard for a match covers the same number of innings. In order to do this I deliberately skip any super overs that are listed within scorecards (as CricketArchive sometimes do), as other sites don’t currently include super overs in their scorecards. I’d prefer to see super overs included in scorecards by default, but I’m not yet the arbiter of all that is right with regards to scorecards.
The batting team for this particular innings is checked to ensure that it is the same in all of the scorecards.
Checks that the extras listed for the innings match across all of the scorecards. This check requires that all of the different types of extras match, not just the totals.
The total runs scored, the number of wickets that fell, and whether the innings was declared (for multi-day matches) are compared across the scorecards to ensure that each field matches.
Fall of wicket checks
Number of fallen wickets
Checks that the number of wickets listed in the “fall of wickets” section of the innings matches across sites. I actually do some extra work for this check, due to different ways sites deal with batters who retire hurt. The ICC include such batters in the “fall of wickets” section, but with the number of the wicket that is the same as the previous entry, ESPNcricinfo include the same batter but indicate the reason, while CricketArchive don’t include the retired batter at all. I exclude the retired hurt batters from the ICC and ESPNcricinfo lists to make comparison easier.
Details for each fallen wicket
As well as checking for the number of fallen wickets in an innings I also check the details of each individual fallen wicket entry. This involves checking the wicket number, and score at the time of the wicket, as well as the delivery on which the wicket occurred (if available). For CricketArchive, generally the delivery for the fall of a wicket is recorded in the form 11.5 (or similar), however if the final wicket of an innings falls on the last ball of the final over the delivery is recorded differently with the over number alone being used. I adjust for this in order to make comparisons possible.
Number of batters
The simplest check I do with regards to batters in an innings is to check that the various scorecards have the same number of batters listed for the innings. This doesn’t include any players who didn’t bat simply because they didn’t get in or were absent hurt.
Number of players who did not bat
This check compares the scorecards to ensure that the number of players listed as not having batted in the innings are the same. Absent hurt batters are not included in this. Surprisingly this could be a source of discrepancies between sources, as there isn’t yet an agreed standard for including or excluding players who come into a team as concussion substitutes, some sites including them and some not. For that reason if there is a discrepancy due to that difference in approach (but only for that reason) I don’t regard such discrepancies as actual errors.
Number of players who were absent hurt
The last batter number-based check is one for the number of absent hurt players in an innings.
Details for each batter
One of the more complicated checks takes place when I validate each batter. Right now I check any batting figures that appear on 2 or more scorecards for the batter, as well as the dismissal method (if the batter was out). Most figures must match exactly, however I do allow a slight leeway (of 0.01 runs) for the strike rate, as different sites use different methods of rounding.
The actual person who batted is also checked to ensure that the scorecards don’t have different players even if the other details match. Doing this alone caused me to start another project to allow people to be easily mapped across sites, and some day soon I’ll release the details of that.
Number of bowlers
As with batters, the simplest check I do with regards to bowlers in an innings is to check that the various scorecards have the same number of bowlers listed for the innings.
Details for each bowler
As with batters, one of the more complicated checks takes place when I validate each bowler. Right now I check any bowling figures that appear on 2 or more scorecards for the bowler. Most figures must match exactly, however I do allow a slight leeway (of 0.01 runs) for the economy rate, as different sites use different methods of rounding.
The actual person who bowled is also checked to ensure that the scorecards don’t have different players even if the other details match. This follows the same procedure used for the batters.