A complete list of the checks I make when comparing scorecards.
If you’ve had a look around the site you may seen mention of various different “statuses”, but what do they actually mean? Well let me tell you. There are 3 statuses that both a match, and an individual comparison can have, Complete match, Leeway match, and Mismatch. The status for a match is determined to be the worst status for discrepancies found for the match, so a store with 3
Complete match and 2
Leeway match statuses would have a match status of
If a comparison finds that a particular piece of information matches across all scorecards it’s part of then the system regards that as a
Complete match. If this status occurs for individual comparisons I skip the comparison when generating the page for a match, as I’m only concerned with showing non-matching data.
When a comparison has a status of
Leeway match this means that while I couldn’t find a perfect match between the various scorecards for data being checked the worst discrepancies I found were down to acceptable differences between fields such as strike rate, or economy rate. For example, different sites round strike rates differently, meaning that one site might report a strike rate as 123.13 and another site as 123.14. I allow a “leeway” of 0.01 for these certain fields, and treat any discrepancies that are within this leeway as being
Leeway match discrepancies.
Mismatch is recorded when I find a non-acceptable discrepancy between the data from th scorecards for a match. An example might be a scorecards showing a different number of runs scored by a batter, or different types of extras for an innings.
Match information checks
This check ensures that the result listed on each scorecard actually matches. I do some normalisation of the result to make it easier to compare, for example by tweaking the names, and using the single form
D/L method for the various versions of Duckworth-Lewis(-Stern).
This check ensures that the teams listed on each scorecard actually match.
As well as checking that the data already mentioned is correct for the overall scorecard, I also perform a number of checks on each innings within a scorecard.
Number of innings
Ensures that each scorecard for a match covers the same number of innings. In order to do this I deliberately skip any super overs that are listed within scorecards (as CricketArchive sometimes do), as other sites don’t currently include super overs in their scorecards. I’d prefer to see super overs included in scorecards by default, but I’m not yet the arbiter of all that is right with regards to scorecards.
The batting team for this particular innings is checked to ensure that it is the same in all of the scorecards.
Checks that the extras listed for the innings match across all of the scorecards. This check requires that all of the different types of extras match, not just the totals.
The total runs scored, the number of wickets that fell, and whether the innings was declared (for multi-day matches) are compared across the scorecards to ensure that each field matches.
Fall of wicket checks
Number of fallen wickets
Checks that the number of wickets listed in the “fall of wickets” section of the innings matches across sites. I actually do some extra work for this check, due to different ways sites deal with batters who retire hurt. The ICC include such batters in the “fall of wickets” section, but with the number of the wicket that is the same as the previous entry, ESPNcricinfo include the same batter but indicate the reason, while CricketArchive don’t include the retired batter at all. I exclude the retired hurt batters from the ICC and ESPNcricinfo lists to make comparison easier.
Details for each fallen wicket
As well as checking for the number of fallen wickets in an innings I also check the details of each individual fallen wicket entry. This involves checking the wicket number, and score at the time of the wicket, as well as the delivery on which the wicket occurred (if available).
Number of batters
The simplest check I do with regards to batters in an innings is to check that the various scorecards have the same number of batters listed for the innings. This doesn’t include any players who didn’t bat simply because they didn’t get in or were absent hurt.
Number of players who did not bat
This check compares the scorecards to ensure that the number of players listed as not having batted in the innings are the same. Absent hurt batters are not included in this. Perhaps surprisingly this can be a source of discrepancies between sources, as there doesn’t yet seem to be an agreed standard for including or excluding players who come into a team as concussion substitutes, some sites including them and some not.
Number of players who were absent hurt
The last batter number-based check is one for the number of absent hurt players in an innings.
Details for each batter
One of the more complicated checks takes place when I validate each batter. Right now I check any batting figures that appear on 2 or more scorecards for the batter, as well as the dismissal method (if the batter was out). Most figures must match exactly, however I do allow a slight leeway (of 0.01 runs) for the strike rate, as different sites use different methods of rounding.
Number of bowlers
As with batters, the simplest check I do with regards to bowlers in an innings is to check that the various scorecards have the same number of bowlers listed for the innings.
Details for each bowler
As with batters, one of the more complicated checks takes place when I validate each bowler. Right now I check any bowling figures that appear on 2 or more scorecards for the bowler. Most figures must match exactly, however I do allow a slight leeway (of 0.01 runs) for the economy rate, as different sites use different methods of rounding. I also deliberately ignore wides and no-balls listed for bowlers on the ICC site, as they are 1) not shown on the scorecard page despite being in the associated JSON, and, 2) incomplete at best anyway.