Checks

A complete list of the checks I make when comparing scorecards.

Comparison statuses

If you’ve had a look around the site you may seen mention of various different “statuses”, but what do they actually mean? Well let me tell you. There are 3 statuses that both a match, and an individual comparison can have, Complete match, Leeway match, and Mismatch. The status for a match is determined to be the worst status for discrepancies found for the match, so a store with 3 Complete match and 2 Leeway match statuses would have a match status of Leeway match.

Complete match

If a comparison finds that a particular piece of information matches across all scorecards it’s part of then the system regards that as a Complete match. If this status occurs for individual comparisons I skip the comparison when generating the page for a match, as I’m only concerned with showing non-matching data.

Leeway match

When a comparison has a status of Leeway match this means that while I couldn’t find a perfect match between the various scorecards for data being checked the worst discrepancies I found were down to acceptable differences between fields such as strike rate, or economy rate. For example, different sites round strike rates differently, meaning that one site might report a strike rate as 123.13 and another site as 123.14. I allow a “leeway” of 0.01 for these certain fields, and treat any discrepancies that are within this leeway as being Leeway match discrepancies.

Mismatch

A Mismatch is recorded when I find a non-acceptable discrepancy between the data from th scorecards for a match. An example might be a scorecards showing a different number of runs scored by a batter, or different types of extras for an innings.

Match information checks

Result

This check ensures that the result listed on each scorecard actually matches. I do some normalisation of the result to make it easier to compare, for example by tweaking the names, and using the single form D/L method for the various versions of Duckworth-Lewis(-Stern).

Teams

This check ensures that the teams listed on each scorecard actually match.

Innings checks

As well as checking that the data already mentioned is correct for the overall scorecard, I also perform a number of checks on each innings within a scorecard.

Number of innings

Ensures that each scorecard for a match covers the same number of innings. In order to do this I deliberately skip any super overs that are listed within scorecards (as CricketArchive sometimes do), as other sites don’t currently include super overs in their scorecards. I’d prefer to see super overs included in scorecards by default, but I’m not yet the arbiter of all that is right with regards to scorecards.

Batting team

The batting team for this particular innings is checked to ensure that it is the same in all of the scorecards.

Extras

Checks that the extras listed for the innings match across all of the scorecards. This check requires that all of the different types of extras match, not just the totals.

Totals

The total runs scored, the number of wickets that fell, and whether the innings was declared (for multi-day matches) are compared across the scorecards to ensure that each field matches.

Fall of wicket checks

Number of fallen wickets

Checks that the number of wickets listed in the “fall of wickets” section of the innings matches across sites. I actually do some extra work for this check, due to different ways sites deal with batters who retire hurt. The ICC include such batters in the “fall of wickets” section, but with the number of the wicket that is the same as the previous entry, ESPNcricinfo include the same batter but indicate the reason, while CricketArchive don’t include the retired batter at all. I exclude the retired hurt batters from the ICC and ESPNcricinfo lists to make comparison easier.

Details for each fallen wicket

As well as checking for the number of fallen wickets in an innings I also check the details of each individual fallen wicket entry. This involves checking the wicket number, and score at the time of the wicket, as well as the delivery on which the wicket occurred (if available).

Batter checks

Number of batters

The simplest check I do with regards to batters in an innings is to check that the various scorecards have the same number of batters listed for the innings. This doesn’t include any players who didn’t bat simply because they didn’t get in or were absent hurt.

Number of players who did not bat

This check compares the scorecards to ensure that the number of players listed as not having batted in the innings are the same. Absent hurt batters are not included in this. Perhaps surprisingly this can be a source of discrepancies between sources, as there doesn’t yet seem to be an agreed standard for including or excluding players who come into a team as concussion substitutes, some sites including them and some not.

Number of players who were absent hurt

The last batter number-based check is one for the number of absent hurt players in an innings.

Details for each batter

One of the more complicated checks takes place when I validate each batter. Right now I check any batting figures that appear on 2 or more scorecards for the batter, as well as the dismissal method (if the batter was out). Most figures must match exactly, however I do allow a slight leeway (of 0.01 runs) for the strike rate, as different sites use different methods of rounding.

Bowler checks

Number of bowlers

As with batters, the simplest check I do with regards to bowlers in an innings is to check that the various scorecards have the same number of bowlers listed for the innings.

Details for each bowler

As with batters, one of the more complicated checks takes place when I validate each bowler. Right now I check any bowling figures that appear on 2 or more scorecards for the bowler. Most figures must match exactly, however I do allow a slight leeway (of 0.01 runs) for the economy rate, as different sites use different methods of rounding. I also deliberately ignore wides and no-balls listed for bowlers on the ICC site, as they are 1) not shown on the scorecard page despite being in the associated JSON, and, 2) incomplete at best anyway.