Checks performed

A complete list of the checks I make when comparing scorecards.

Comparison statuses

If you’ve had a look around the site you may seen mention of various different “statuses”, but what do they actually mean? Well let me tell you. There are 4 statuses that both a match, and an individual comparison can have, Complete match, Approved match,, Leeway match, and Mismatch. The status for a match is determined to be the worst status for discrepancies found for the match, so a store with an Approved match, 3 Complete match, and 2 Leeway match statuses would have a match status of Leeway match.

Complete match

If a comparison finds that a particular piece of information matches across all scorecards it’s part of then the system regards that as a Complete match. If this status occurs for individual comparisons I skip the comparison when generating the page for a match, as I’m only concerned with showing non-matching data.

Approved match

When a comparison has a status of Approved match this means that a discrepancy was found but that, after looking into it, I regard it as a reasonable difference. Examples of this are for things such as the number of batters in an innings where concussion substitutions (or X-player replacements in the Big Bash) take place, as there is no agreed method by which the substituted player is referred to, or occasions where a source links to the wrong player (where names are similar) but the details are otherwise correct.

Leeway match

When a comparison has a status of Leeway match this means that while I couldn’t find a perfect match between the various scorecards for data being checked the worst discrepancies I found were down to acceptable differences between fields such as strike rate, or economy rate. For example, different sites round strike rates differently, meaning that one site might report a strike rate as 123.13 and another site as 123.14. I allow a “leeway” of 0.01 for these certain fields, and treat any discrepancies that are within this leeway as being Leeway match discrepancies.

Mismatch

A Mismatch is recorded when I find a non-acceptable discrepancy between the data from th scorecards for a match. An example might be a scorecards showing a different number of runs scored by a batter, or different types of extras for an innings.

Match information checks

Result

This check ensures that the result listed on each scorecard actually matches. I do some normalisation of the result to make it easier to compare, for example by tweaking the names, and using the single form D/L method for the various versions of Duckworth-Lewis(-Stern).

Teams

This check ensures that the teams listed on each scorecard actually match.

Innings checks

As well as checking that the data already mentioned is correct for the overall scorecard, I also perform a number of checks on each innings within a scorecard.

Number of innings

Ensures that each scorecard for a match covers the same number of innings. In order to do this I deliberately skip any super overs that are listed within scorecards (as CricketArchive sometimes do), as other sites don’t currently include super overs in their scorecards. I’d prefer to see super overs included in scorecards by default, but I’m not yet the arbiter of all that is right with regards to scorecards.

Batting team

The batting team for this particular innings is checked to ensure that it is the same in all of the scorecards.

Extras

Checks that the extras listed for the innings match across all of the scorecards. This check requires that all of the different types of extras match, not just the totals.

Totals

The total runs scored, the number of wickets that fell, and whether the innings was declared (for multi-day matches) are compared across the scorecards to ensure that each field matches.

Fall of wicket checks

Number of fallen wickets

Checks that the number of wickets listed in the “fall of wickets” section of the innings matches across sites. I actually do some extra work for this check, due to different ways sites deal with batters who retire hurt. The ICC include such batters in the “fall of wickets” section, but with the number of the wicket that is the same as the previous entry, ESPNcricinfo include the same batter but indicate the reason, while CricketArchive don’t include the retired batter at all. I exclude the retired hurt batters from the ICC and ESPNcricinfo lists to make comparison easier.

Details for each fallen wicket

As well as checking for the number of fallen wickets in an innings I also check the details of each individual fallen wicket entry. This involves checking the wicket number, and score at the time of the wicket, as well as the delivery on which the wicket occurred (if available). For CricketArchive, generally the delivery for the fall of a wicket is recorded in the form 11.5 (or similar), however if the final wicket of an innings falls on the last ball of the final over the delivery is recorded differently with the over number alone being used. I adjust for this in order to make comparisons possible.

Batter checks

Number of batters

The simplest check I do with regards to batters in an innings is to check that the various scorecards have the same number of batters listed for the innings. This doesn’t include any players who didn’t bat simply because they didn’t get in or were absent hurt.

Number of players who did not bat

This check compares the scorecards to ensure that the number of players listed as not having batted in the innings are the same. Absent hurt batters are not included in this. Surprisingly this could be a source of discrepancies between sources, as there isn’t yet an agreed standard for including or excluding players who come into a team as concussion substitutes, some sites including them and some not. For that reason if there is a discrepancy due to that difference in approach (but only for that reason) I don’t regard such discrepancies as actual errors.

Number of players who were absent hurt

The last batter number-based check is one for the number of absent hurt players in an innings.

Details for each batter

One of the more complicated checks takes place when I validate each batter. Right now I check any batting figures that appear on 2 or more scorecards for the batter, as well as the dismissal method (if the batter was out). Most figures must match exactly, however I do allow a slight leeway (of 0.01 runs) for the strike rate, as different sites use different methods of rounding.

The actual person who batted is also checked to ensure that the scorecards don’t have different players even if the other details match. Doing this alone caused me to start another project to allow people to be easily mapped across sites, and some day soon I’ll release the details of that.

Bowler checks

Number of bowlers

As with batters, the simplest check I do with regards to bowlers in an innings is to check that the various scorecards have the same number of bowlers listed for the innings.

Details for each bowler

As with batters, one of the more complicated checks takes place when I validate each bowler. Right now I check any bowling figures that appear on 2 or more scorecards for the bowler. Most figures must match exactly, however I do allow a slight leeway (of 0.01 runs) for the economy rate, as different sites use different methods of rounding.

The actual person who bowled is also checked to ensure that the scorecards don’t have different players even if the other details match. This follows the same procedure used for the batters.