Category: Defining Statistics

WAR, wOBA, UZR, What Is It Good For? Absolutely... Something? (Defining Statistics 2)

7/10/2022

Ah, WAR. The metric that has taken the baseball world by storm, whether it be for deciding who in a given season is the best player and most deserving of the MVP award, or even deciding who cumulatively are the best players of all-time and deserve to be in the Hall of Fame in Cooperstown. Many people love to use WAR these days, be it the baseball math nerds or even the more traditional baseball fans that are starting to learn the new ways. To me, the problem is that neither of these parties truly understand WAR. The traditionalists may write off WAR altogether, but when they do use it are wise enough to not use it as an end-all be-all of player performance. The analytics gurus use WAR with reckless abandon, crafting as many different splits and scenarios as they can imagine all based around WAR. However, despite the strong analytical acumen of this party, most members take WAR as is, treating it like any ordinary recorded statistic. WAR is not batting average, an easily calculated, in-the-moment, recordable statistic; rather, WAR is a complex framework for measuring overall player value. Many people that are very likely capable of actually understanding WAR choose not to do so. Even pages online that are meant to help describe WAR fail to adequately provide all of the details in one convenient place. It is my opinion that if anyone is to rely so heavily on WAR for their opinions on which players are the best, they ought to know how it is actually calculated. Thus is the purpose for this blog post: what is WAR, how is it calculated, and what are some of the benefits and drawbacks of using it as the primary basis for determining a player's total contributions?

WAR: How It's NOT Calculated
The first step of WAR is knowing what it stands for! WAR is an acronym for Wins Above Replacement. The general idea is that it stands to measure how many more wins a player is worth compared to a replacement level player. From an initial intuitive standpoint, we may think that WAR is calculated in an entirely different way from how it actually is. For instance, suppose we have a starting catcher who, when in the lineup, results in his team having a winning percentage of .600. Now suppose that whenever the backup catcher plays instead, the team has a winning percentage of just .400. That's a .200 increase in winning percentage when the starter plays, which across a 162 game season is worth about 32 games. So, maybe we think the starting catcher is worth 32 wins above the replacement/backup catcher. An alternative way to think this through is that if the team were to go .600 with the starting catcher for all 162 games, they'd win about 97 games, and if they went .400 with the backup for all 162 games, they'd only win about 65 games, which again is a difference of about 32 games. Of course, this example is certainly extreme, but based solely on the name of WAR - Wins Above Replacement - this may be how we think WAR is calculated. My favorite baseball podcast, which I often listen to on my drive to and from work, is called Effectively Wild and is sponsored by FanGraphs. The podcast is free to listen to on the Apple Podcasts app (and surely other podcast apps, for you non-Apple users). On episode 1841, they discussed ESPN baseball reporter Jeff Passan's tweet about the increase in the Minnesota Twins' record when star outfielder Byron Buxton is playing. About an hour and 3 minutes into the episode, they do a segment discussing the tweet and the inaccuracies of such an approach to assessing player value. Based on Passan's tweet, Buxton increases the Twins' win pace from 75 games to 101 games, making his supposed valued about 26 games! In the episode segment, they sought to find the player whose team's winning percentage increased the most when they played, for a 3 year stretch (the amount of time referenced in Passan's tweet). The results are... not promising. By this thought process, Mike Squires, a first basemen for the Chicago White Sox in the '70s and '80s, is the most valuable. From 1982 to 1984, the White Sox had a .669 winning percentage with Squires in the lineup at any point during the game, and a .145 winning percentage without Squires. As the workbook shows, you're hard pressed to find any truly memorable player at the top of list. The closest is Curt Flood, who while a great player and an important figure in baseball history, certainly wouldn't be anyone's pick for the greatest ever. On the reverse side, some of the seemingly worst players ever - players whose teams had higher winning percentages with them not appearing in the game - included some notable Hall of Famers, such as Enos Slaughter and Johnny Mize. Clearly this approach isn't accurate, but why is that? For one, many of the 'best' players were defensive replacements, who appeared later in games when their team was already ahead, making it much more likely that they would win. Alternatively, many of the 'worst' players were pinch hitters, who also appeared later in games but generally when their team was losing, making it more likely that they would lose. To avoid these substitution issues, the same analysis was done again, but this time only looking at the winning percentage of the team when the player started the game. These results are on the 3rd tab of the workbook, and the results are slightly better, but still not significant. We see some more names we know at the top, such as Shoeless Joe Jackson, Nap Lajoie, and Barry Bonds, but also many more recent or active players, such as Javier Baez, Andrew Benintendi, Manny Machado, and Trea Turner. But there are also notable names amongst the players at the bottom, such as Tony Gwynn, Craig Biggio, Reggie Jackson, and Xander Bogaerts. I doubt anyone thinks that Andrew Benintendi is one of the greatest MLB players ever, or that Tony Gwynn is one of the worst. The reason these results are so off is simply because whether a team wins or losses a game has many more factors to it than just a singular player's presence. For instance, whether or not other starters are also sitting out, making the overall starting lineup worse and less likely to win. Or, whether the normal starters just happen to play worse when one guy happens to be out. Most importantly, who the other team is! If the backups always play against the bad teams, but the starters have to play against the good teams, the team will likely have a higher winning percentage with the backups. The list goes on. In conclusion, Wins Above Replacement sounds like it would be how much more a player's team wins when he is playing versus when he's replaced, but it isn't, and that approach isn't a good indicator of actual player skill. We could try to isolate all of these other factors and use this approach, but I haven't seen such a thing been done and it would likely result in far too small of sample sizes for each player in question.

WAR: The Basics
Major League Baseball has an online glossary where it defines most of the statistics used in the game. Included in this glossary is WAR, which MLB makes sound like a fairly straightforward calculation. Again, WAR stands for Wins Above Replacement and seeks to measure the number of wins a player is worth above a replacement level player. Unlike nearly all other statistics, WAR is not a simple plug-and-chug formula based on recorded baseball events, such as getting a hit or striking out. Rather, WAR is more of an ever-evolving framework for determining player value. The general framework for WAR is based on determining the # of runs a player is worth in different areas of the game, and then translating those runs into wins (something I disagree with, which I'll touch more on later). See the general framework equation for position players below:

Pitchers are similar, but we may measure pitching WAR in a different way and then combine that with the pitcher's position-player WAR to get their total WAR. This looks easy, but in reality none of these are simple, recorded events. Not even Batting Runs or Baserunning Runs are variants of RBI or runs scored, but entirely different calculated measures altogether. To define the 'statistic' of WAR is really to define a long stream of statistics that are encompassed in WAR, which is what I'll be doing today. Again, WAR is really more of a framework than a statistic; the people who calculate WAR may change how they calculate these different Run components whenever they want, and do. Generally these changes are due to having more data to include, but the lack of this new data for older players makes the use of WAR when comparing players of different eras and for Hall of Fame consideration particularly troublesome. Because of this, WAR is a much better metric to use to compare current players than it is to compare all players in history. We just don't have the data for Babe Ruth that we have for Mike Trout, and using one version of WAR for Ruth and a different version of WAR for Trout and then comparing their WAR values isn't ideal and really shouldn't be done.

Because of the differences over time in calculating WAR, as well as the inherent flexibility/subjectivity in its calculation, there are actually 3 distinct versions of WAR that are determined by 3 different baseball entities. If you see fWAR, that does NOT mean fielding WAR, but rather WAR as calculated by FanGraphs. Likewise, bWAR or rWAR don't mean batting or running WAR, but rather WAR as calculated by Baseball Reference. Lastly, WARP is calculated by Baseball Prospectus, whose version stands for Wins Above Replacement Player. Baseball Prospectus usually makes you pay to see the methods behind their madness, and people generally care more about WAR than WARP, so I'll only focus on the first two in this post. So, not only will I have to dig into each of the specific calculations for the 'Runs' metrics listed in the above equation, but I'll have to describe how FanGraphs and Baseball Reference both calculate each piece! What fun! Let's take a look.

Batting Runs
Batting Runs is meant to measure the offensive value (in terms of runs) of a player whilst batting, i.e. at the plate. The first step to calculating Batting Runs, which Baseball Reference refers to as Rbat, is calculating Weighted Runs Above Average (wRAA). Both websites use the same basic framework. You can read up on Baseball Reference's version here, and on FanGraph's here. Let's take a look at the equation:

This stat is kind of like if you were given a player's batting average and the number of at bats he had and wanted to determine how many hits he produced, or similarly if you were given a player's on-base percentage and the number of plate appearances he had and wanted to determine how many times he got on base. Essentially, we use a rate stat and combine it with the # of opportunities we have to determine the # of successes we end up with. The first difference here is that the final result isn't a recordable thing where we actually know how many a player achieved (as with hits), but rather a total that we must calculate and use on its own (weighted runs isn't a metric/event recorded during a game). The second difference is that we don't just care about the total final result, but rather how many more a player gets above average (i.e. instead of # of hits, we want # of hits above average, or in this case # of weighted runs above average). The third difference is that we don't use batting average to determine the # of weighted runs, but rather something called Weighted On Base Average (wOBA). So, what is wOBA? Essentially, wOBA is a rate stat that seeks to explain offensive value better than the traditional triple-slash-line metrics (batting average, on-base percentage, and slugging percentage) do, as well as better than OPS (on-base plus slugging) does. Let's enter a tangent on these, as really understanding wOBA is important on its own.

If you need a refresher on these different rate stats, feel free to read an earlier article of mine that discusses them, or look them up in Google or in the MLB glossary linked above.

Batting average is the most traditional offensive rate stat, and tells us the % of times a player got a hit, when he had the chance to (we use at-bats as the denominator, so we exclude sac bunts, sac flies, walks, catcher interferences, and hit by pitches). As we can see below, batting average certainly explains some of the ability of a team to score runs (generally, a higher team batting average means a team scores more runs per game), but it could be better.

The downside of batting average is that it does NOT consider walks or hit by pitches, and also considers all hits to be of equal value. Walks mean the runner is on base and has a chance to score, so clearly there is value there. Also, clearly home runs, which for sure score the batter and any other runners on base, are more valuable than singles that only give the batter a chance to score.

The next step up is on-base percentage (OBP), which does slightly better in that it does consider walks and hit by pitches. The book and movie Moneyball became famous in describing the 2002 Oakland Athletics' strategy to prioritize players with higher on-base percentages rather than batting averages, leading to great regular season success. Let's see how a team's on-base percentage does at describing its runs per game:

Better than batting average for sure, but there's still room for improvement. For one, on-base percentage still considers all hits to be of equal value, which is of course wrong.

Then we have slugging percentage, which finally weighs the different types of hits differently. Singles are worth 1, doubles are worth 2, triples are worth 3, and home runs are worth 4. How does a team's slugging percentage do at explaining its runs scored per game?

While this is still better than batting average, it's actually a bit worse than on-base percentage. This is because slugging percentage ignores the improvement from on-base percentage in considering walks and hit by pitches, and also because the hitting weights that slugging percentage uses aren't actually all that correct. While a single certainly isn't worth as much as a home run, the long ball isn't quite 4 times more valuable than a typical base-knock.

Then steps in OPS (literally, On base Plus Slugging), which combines on-base percentage and slugging percentage to form a rate stat that actually weighs the types of hits differently and still considers walks and hit by pitches as valuable. As we can see, a team's OPS is the best describer of its runs scored per game yet:

OPS is a pretty solid indicator, but almost out of coincidence. There's no real work out there supporting why the OPS weights should be what they are, but rather it's just a quick and easy-to-calculate stat that does pretty well. Since it adds OBP to slugging, the weights are essentially 1 for a walk or HBP, 2 for a single, 3 for a double, 4 for a triple, and 5 for a home run. However, OPS is mathematically sinful in that it adds two pieces with different denominators; the denominator of OBP is plate appearances, but the denominator for slugging percentage is at-bats. Furthermore, since slugging percentage is nearly always a higher figure than OBP is (a good OBP is .400, a good slugging is .600), OPS is slightly skewed towards favoring slugging percentage. Thus, players that lead the league in slugging are more likely to lead the league in OPS than players that lead the league in OBP. This shouldn't be the case, since we showed previously that OBP is actually better than slugging. Truthfully, a quicker and more accurate approach would be to use something along the lines of slugging plus 1.5 to 2 times OBP; that makes the inherent weights closer to what wOBA supports. OPS is a better indicator than the other 3, but we can still do better, and try to actually explain our weights by using mathematical and baseball logic. Enter wOBA.

The first step to understanding wOBA is understanding RE24, which is the Run Expectancy based on the 24 base-out states in baseball. There are 3 out situations where play continues in baseball: 0 outs, 1 out, or 2 outs. Likewise, there are 8 distinct combinations of runners on the bases, from nobody on to the bases loaded. Combining these, we have a total of 24 distinct base-out states that exist in baseball, such as a man on first with 1 out or a man on third with 2 outs. Not surprising, we can expect that the number of runs a team will score on average depends on the base-out state that the team is in. It's more likely that you'll score more runs with 0 outs and the bases loaded than with 2 outs and nobody on. Below is an example run expectancy matrix from FanGraphs, showing how many runs we expect a team to score based on the base-out state:

Note that this isn't the golden, catch-all run expectancy matrix for all of baseball history, and nobody appears to use the matrices in that way. Rather, the matrices vary based on the data being used to develop them, and it's common to develop different matrices for different years or periods of time. For example, Tom Tango has a different run expectancy matrix located here, using data from 1999 to 2001. Tango has an even more comprehensive run expectancy matrix located here, which has 4 different matrices for 2010-2015, 1993-2009, 1969-1992, and 1950-1969. The post also shows the frequency of each base-out state across these periods, as well as the probability that a run will score for each of the base-out states.

The next step is using these run expectancies to calculate the weights for our different events in wOBA. FanGraphs makes this sound more complicated than it really is and refers to it as Linear Weights. Tom Tango also explains linear weights here. In reality, you simply start with the current base-out state and see which state you ended up in, as a result of the offensive event. The weight is simply the change in the resulting run expectancy, plus any runs that actually scored. For example, using the above run expectancy matrix, if the bases are empty with 0 outs, my team expects to score .461 runs. If I then hit a single and change the state to a man on first with 0 outs, the expectation goes up to .831 runs, meaning my single increased my team's expectancy by .831 - .461 = .37 runs, so the weight for my single would be .37 runs. If instead I hit a home run, the ending state would be the exact same, but I actually scored a run, so the weight for my homer would be 1 run. If I hit a double with 2 outs and men on first and third, and both runners score, my starting expectancy was .471 runs, and my end state expectancy (man on 2nd, 2 outs) is .305 runs + the 2 runs that actually scored or 2.305 runs. Then the value of my double would be 2.305 - .471, or 1.834 runs. The process is the same for any event type and any base-out state. Find the difference in run expectancies for the start and end states and add any runs that scored.

As you may imagine, the resulting weights for the same event type can vary heavily; a single with 0 outs and the bases empty won't increase the expectancy as much as a single with 2 outs and the bases loaded with 2 runs scoring. The core philosophy behind the weighted runs approach is that the traditional stats of RBI and run scored are too contingent on the skill of a player's teammates, and thus fail to accurately reflect the player's own skill. You could bat 1.000 and hit only triples for an entire season and *technically* never record any RBI if nobody was ever on base when you were up to bat, and also never score a run if nobody else ever drove you in or you never stole home or advanced on a wild pitch, passed ball, or a balk. Since you driving people in is dependent on there being people on base, and you scoring is mainly dependent on other competent batters driving you in, RBI and runs scored are slightly flawed metrics in measuring an individual player's general run producing value. So, instead we seek to figure out the average run value of the different offensive events and assess value that way. Because of this, we don't add up a player's total increase in their team's run expectancy for all their offensive hits. Rather, we look at the total increase for a given event type for all players, and then divide by the number of times that event occurred to get the average increase in run expectancy for that event type (be it singles, doubles, etc.). Put another way, I don't determine that one of Joey Votto's singles was worth x runs and that another was worth y runs and then add those all up, but rather determine that any single on average is worth z runs and multiply by the # of singles that Votto recorded. If we didn't do this, we'd be repeating the flaws of RBI and runs scored since the value of our events would be dependent on how many runners are on base. This average increase in run expectancy is *almost* the weight for the event type in the wOBA formula. This post by Tom Tango, the creator of wOBA, shows the weights for each event type by base-out state, as well as their overall average values, using data from 1999 to 2002.

The final step to get the wOBA formula weights is a little bit of shifting and scaling. First, since the other rate stats like batting average, OBP, slugging percentage, and OPS all treat outs as having 0 weight, all of the weights are shifted up by the value of an out. As the Tango post above shows, an out is actually worth about -.3 runs; thus, all other event types are shifted up by .3, meaning our HR value of 1.409 becomes 1.709. Second, to put wOBA on a scale that is familiar with baseball fans (i.e. easier to determine what a 'good' wOBA is), all of the wOBA weights are multiplied by what is called the wOBA Scale so that the total scale of wOBA is the same as the league's average on-base percentage. The wOBA Scale is simply the unscaled (but shifted) wOBA divided by the league average OBP. So, if say my wOBA Scale was 1.15, I would multiply that by 1.709 to get my final wOBA weight for HRs as 1.965.

Once we have all of the now scaled wOBA weights, we can use them in our actual wOBA equation, which works similar to the other offensive rate stats. The below equation is from The Book, Tom Tango's book where he discusses wOBA and other statistical baseball topics:

The NIBB stands for non-intentional bases on balls (all walks besides intentional walks), and the RBOE stands for reached based on error, a figure that most wOBA equations today don't include. In The Book, Tango used a different dataset, so his initial unscaled weights were slightly different, thus the difference in the HR weight here. Nonetheless, the process is the same: determine the unscaled weights by using the run expectancy matrix and linear weights, shift the weights up based on the run value of an out, and then multiply by the wOBA Scale to get wOBA on the same scale as on-base percentage.

How does a team's wOBA do in describing its runs scored per game? Let's take a look:

While it may be difficult to visually see, wOBA is the best describer of run scoring yet, even better than OPS. And better yet, wOBA actually has some thought and data behind why it weights each event type a certain way. However, to me, wOBA is not without its flaws either. Unlike our other rate stats, whose weights (albeit inaccurate) are definitively locked in until the end of time (i.e. a single is always worth 1 in slugging percentage), the wOBA weights are actually recalculated and applied every season. FanGraphs has a list of the wOBA weights and the wOBA Scale for each season, here. It's my opinion that a single does have a true intrinsic run value throughout the course of baseball history, and that players who hit more singles in a given year shouldn't be docked because supposedly singles were less valuable that season. I believe that the value of a single doesn't come from its relative frequency/demand (it's not a stock or commodity), but rather solely in how close it puts the batter to scoring and how many runs it drives in on average. To this end, I have my entire own rate statistic and measurement of player value that I look forward to introducing soon.

Now that we know all about wOBA, returning to the wRAA equation is fairly straightforward. First, we calculate the league average wOBA for that season, and subtract it from wOBA. Since we're calculating weighted runs above average, we only care about the 'above average' part of wOBA. Second, you'll notice that wRAA actually divides by the wOBA Scale, returning us back to our more true weights. wRAA is a standalone statistic and doesn't care about being on the same scale as the league average on-base percentage, like wOBA does. Doing so brings us to what Tango referred to as the 'Run value per PA above average'. By multiplying by a player's actual plate appearances, we get his wRAA as shown in the equation above.

Now that we have wRAA, the two sites apply some different adjustments to get their versions of Batting Runs. FanGraphs adjusts by league and park. I somewhat agree with park adjustments, but am more against league adjustments. For the park adjustment, FanGraphs uses what are called Park Factors. The idea is that some parks are higher or lower run scoring environments, so each player's wRAA should be scaled by his home team's Park Factor to adjust his offensive skill. My earlier 'Defining Statistics' article also discussed park factors. Baseball Savant has a list of Park Factors by team and event here, and FanGraphs describes them in more detail here. Baseball Reference also adjusts for park, and details that here. I won't go into the specifics, but essentially teams who have ballparks that experience more runs scored than average will have Park Factors greater than 1, and teams whose parks experience less runs being scored will have Park Factors less than 1. Thus, players that play for a team with a higher Park Factor will have their wRAA decreased (so we don't favor Rockies or Reds players too much), and players that play for a team with a lower Park Factor will have their wRAA increased (so we don't penalize Mariners or Athletics players too much). The adjustment is done by taking the MLB league average runs per plate appearance, and subtracting from that the park-adjusted league average runs per plate appearance, weighted by the player's number of plate appearances. A lot there, so look at the equation later on.

A similar thing is done for the league adjustment, but instead of using the MLB league average runs per plate appearance, we use the MLB league average Weighted Runs Created (wRC) per plate appearance. Well, what is wRC? Take a look:

wRC is another Tango creation and is described by FanGraphs (along with the more popular wRC+) here. You may notice that the first part of this equation is similar to wRAA; wRAA is just wRC with the league runs per plate appearance set to 0. So, wRC is basically just wRAA but scaled for the league's run scoring environment that season. Ok, so let's move on back to adjusting wRAA to get Batting Runs.

With the league adjustment, we take the overall MLB league runs per plate appearance and subtract from it the specific AL or NL league wRC per plate appearance, and then multiply by a player's number of plate appearances. Take a look at the equation below to get a better feel.

Finally, to get Batting Runs for FanGraphs, we take the baseline wRAA and add the park and league adjustments as discussed to get the following equation:

Realistically, I fail to directly see the rationale behind using the AL or NL wRC per plate appearance as an adjustment. It would make more since to me to simply calculate another 'League Factor' as the AL R/PA divided by the NL R/PA (or vice versa) and then multiply that factor by the MLB R/PA as the value to subtract and adjust by, but I digress. I also disagree with the prospect of having to adjust for specific league as well. As this shows, in the 117 World Series played in history, the AL has won 66 times and the NL has won 51 times, meaning the AL wins the World Series about 56.4% of the time. That's not a big enough change from .500 for me, especially since much of the AL victories are attributable to one specific team, the New York Yankees. Likewise, as this shows, in the 91 All-Star games that have been played in history, the AL has won 46 and the NL has won 43, with 2 ties. While I still have more research to do on my end before I can fully support or be against park or league adjustments, for now I feel that the league adjustment is unnecessary, and while I acknowledge that some parks are easier to score in than others, I fear adjusting real, recordable events like home runs by some factor into hypothetical amounts. The more recent innovations where we can actually determine if a given ball would be a homer in different parks based on its launch angle, distance, exit velocity, etc. is a much better approach to adjusting for park, in my mind. Since we don't have this data for our older players, I may be in favor of not using park factors at all when comparing players across time.

We've finished Batting Runs for FanGraphs, but still have to tidy up Baseball Reference's adjustments. As I said previously, Baseball Reference also adjusts for park and league. It adjusts by park as below, where the Ball Park Factor (BPF) is on a scale of 100 being average, unlike with FanGraphs and Baseball Savant where the factors are on a scale of 1 being average:
wRAA_pf = wRAA - (BPF/100 - 1) * PA * lgR/PA / (BPF/100)
The rest of the adjustments aren't shown formulaically, but rather just mentioned. Baseball Reference cites the differences in runs per game by the AL or NL in certain years as a reason for the need for a league adjustment (in 1933, AL averaged 5 runs per game, NL averaged 4). Baseball Reference adjusts wOBA to rOBA, which doesn't include pitcher batting stats in its calculation. rOBA values infield and outfield hits of the same type (mainly singles) differently, and likewise values batted-ball outs (such as a flyouts) differently from strikeouts. rOBA also accounts for the values of grounding into double plays, accounts for seasons where caught stealing data is unknown, and also includes reaches on errors as they believe it is a "repeatable skill".

Whew! That's it for Batting Runs, our first part of WAR! I encourage you to all take a look at the links to get a deeper understanding of anything that I couldn't make clear. Moving on.

Baserunning Runs
Baserunning Runs are meant to account for a player's offensive value (in terms of runs) whilst on the base paths. FanGraphs divides this into 3 separate pieces, as outlined below:
Baserunning Runs = UBR + wSB + wGDP
UBR stands for Ultimate Base Running and is meant to measure a player's skill on the bases, NOT counting stolen bases. This means things like advancing from 1st to 3rd on a single, and so on.
wSB stands for Weighted Stolen Base Runs and measures a player's skill at stealing bases, as well as being caught stealing.
wGDP stands for Weighted Grounded Into Double Play Runs and measures a player's skill at avoiding getting out on ground ball double plays.

wSB is the most straightforward and relies on the run-value weights of a stolen base and a caught stealing, determined in the same way as the other offensive events under wOBA. For instance, using the run expectancy matrix above, if I'm on first with 0 outs and steal 2nd, my team's run expectancy goes from .831 runs to 1.068 runs (man on 2nd with 0 outs), which is an increase of 1.068 - .831 = .237 runs. On the flip side, if I were to be caught stealing my team's run expectancy would drop to .243 runs (nobody on, 1 out), which is a decrease in .831 - .243 = .588 runs. So in this specific scenario, a SB is worth .237 and a CS is worth -.588. However, we must find the average value of the SB and CS by considering all possible stealing scenarios and weighing them based on their frequencies. Tom Tango has a stolen base being worth about .175 runs and a caught stealing being worth about -.467 runs in The Book. In his blog post that I linked to earlier, he has SBs at .195 and CSs at -.456. The FanGraphs weights for each season have a SB at .2 runs and a CS generally around -.4 runs. Again, these weights change every year, but if we use the ones I just mentioned from Tango's book we would get the following formula for wSB:

We get run value credit for each base we steal, and we get run value docked for each time we get out trying to steal. We see there's some consideration of the ways that we can get on first base, but what is lgwSB? That's the League Stolen Base Runs, and has the following formula:

We essentially take the league average proportion of times someone on first successfully stole 2nd, but weight based on the run value of being successful and unsuccessful.
Going back to the original wSB equation, we see that it is basically the run value above league average that a player was successful in stealing bases. Kudos for stealing a base, shame for getting out, and we only care what you did above a league average base stealer.

Baseball Reference calculates this piece very similarly, also relying on the wOBA/rOBA/wRAA values for a SB and CS, with the same wRAA adjustments as mentioned previously. The previously linked Baseball Reference wRAA has a list of the run values for stolen bases and caught stealings for each year at the bottom of the page, along with the run values for all the other events. They also have a SB as worth about .2 runs and a CS as worth about -.4 runs. Since they treat it like any other offensive event for wRAA, it already has that above average aspect to it. Baseball Reference refers to its Baserunning Runs as Rbr.

Now let's move on to the non-stolen base aspects of baserunning, but not the ground ball double play part yet. FanGraphs calls this piece UBR, and is given this information by Mitchell Lichtman (who also helped write The Book). You can't really calculate it yourself (well, you could if the necessary data were made available like it is for the batting events), which of course is a criticism of mine for this part of WAR. FanGraphs has a page where it describes UBR, as well as a primer written by Lichtman to describe it even further. Basically UBR is calculated much similarly to the other offensive events, as we see the increase in run expectancy a player gives his team by advancing bases in some way. Using the run expectancy matrix from up above, if I'm on first base with nobody out, my teams run expectancy is .831 runs. If a single is hit and I take the initiative to advance to third, then now my team's run expectancy is 1.798 runs (first and third with 0 outs), an increase of 1.798 - .831 = .967 runs. Now, a runner won't advance to 3rd every time this situation occurs; instead, he could only advance to 2nd, advance all the way home, or get out. We can look at how frequently these different outcomes occur, and use those as weights to multiply by each scenario's respective increase in run expectancy. That product gives us the average run value for the situation, meaning what we would expect an average base runner to do. Then, a baserunner only gets credit for the times that he particularly excels or suffers. If the average runner only advances to 2nd on a single from 1st, then a baserunner won't be rewarded for doing so. However, if that baserunner were to score or advance to 3rd, he would be rewarded relative to that increase, and if he were to get out and fail to even advance to 2nd, he would be docked. So in the previous example, if I expect the average baserunner to merely advance to 2nd (making the base-out state men on 1st and 2nd with 0 outs), the run expectancy is 1.373 runs. That means if I managed to advance further to 3rd, I increased my teams run expectancy above what an average baserunner would do by 1.798 - 1.373 = .425 runs. All of these increases and decreases across my season get tallied up to get my final UBR value.

The links above outline all of the different scenarios that are included in UBR, but essentially it's any time a baserunner could advance and how he does relative to what an average baserunner would do in that same situation. While FanGraphs does have values for UBR for each player each season (you can view Votto's UBR values here by scrolling down to the 'Advanced' table), it doesn't provide the actual data for calculating UBR. To do so, we would need for every advancement situation the run expectancy increase of each outcome, and the frequency of which those outcomes occurred. This would give us what we need to calculate how the league average baserunner would perform. Then, we would also need all the base advancing situations for a given player, how he advanced, the run expectancy increase of that advancement, and how that increase compares to what we'd expect the league average baserunner to do. Conceptually, UBR has as much merit as wOBA and wSB, but it suffers from the lack of available data to the public, as well as the increase in the number of hypotheticals and situations.

Baseball Reference calculates this piece very similarly, and includes it within the Rbr value. However, instead of relying on the change in run expectancy for the different advancements, it just finds the total # of times above or below average that a player advances, as well as how many more or less outs a player recorded on the base paths than average. Then, it multiplies each extra base taken by the run-value of an additional base (about .2 runs per base, roughly same as a SB), and likewise multiplies each extra out by the run-value of an out (about -.48 runs per out, roughly same as a CS). Baseball Reference is essentially an online baseball database, while FanGraphs focuses more on the writing side, so it makes sense for Baseball Reference to have more data. Each player has a base running page that shows the # of times they advanced in many of these situations. You can check out Joey Votto's base running page here by scrolling down until you get to the 'Baserunning & Misc. Stats' table. However, the comparative data for what we'd think an average baserunner would do is not available.

The final piece of Baserunning Runs is the grounding into double plays section. FanGraphs calls that wGDP and explains it here. Baseball Reference calls this piece Rdp and actually treats it as a distinct piece from Rbr. Essentially wGDP looks at how many double play opportunities a player had, and then determines how many times an average player would have hit into a double play. If the player hit into fewer double plays than average, he is rewarded, and vice versa. FanGraphs defines 'double play opportunities' as any time a batter is up with a man on 1st and less than 2 outs, but Baseball Reference defines this as any time a batter is up with a man on 1st, less than 2 outs, at least 1 out is recorded on the play, the batted ball was a ground ball, and the play was not recorded as a hit. Note that this only includes ground ball double plays, not line drive double plays. The idea behind wGDP is penalizing the batter for getting the other guy that was on base out. The batter's getting out is already reflected in wOBA. Since FanGraphs relies more on pure wOBA, which doesn't distinguish normal outs from ground ball double plays, it makes sense for the wider net of 'double play opportunities'. FanGraphs in wGB is penalizing the batter for getting the other runner out, since the batter's getting out is already reflected in wOBA. Alternatively, Baseball Reference's rOBA does take into consideration the worseness of a ground ball double play compared to a normal, non-strikeout out. Because of this, Baseball Reference's Rdp is less about penalizing the batter for getting the runner out, and more about the ability of the batter to beat out the throw to avoid making the play a double play (essentially, how good the player is at turning ground ball double plays into ground ball fielder's choices). This is more inherently a baserunning skill, so I think I prefer how Baseball Reference deals with this. Baseball Reference measures the difference between avoiding an otherwise double play and an actual double play as about .44 runs, roughly the same as avoiding a caught stealing. We get the following equation for Rdp:
R_dp = .44 × ( GIDP_OPPS_player * GIDP_RATE_lg - GIDP_player)
Here, GIDP_player is simply the number of actual ground ball double plays the player recorded.
GIDP_OPPS_player is the number of ground ball double play opportunities the player had, and GIDP_RATE_lg is the league average % of times a player grounds into a double play when given the opportunity to do so, so this product is essentially the number of times we'd expect an average player to ground into a double play. We find the difference from this average, and then multiply by the actual run value. If you beat out more throws and ground into fewer double plays than average, you're adding value, and if you run with a 'Wide Load' sign on your back and seldom beat a throw out, you're taking value away. Baseball Reference doesn't provide data about the league average GIDP rate, so calculating this can be rather difficult. If you go to the linked Joey Votto page above and scroll down to the 'Situational Batting' table, you'll see that Baseball Reference does provide the # of GIDP opportunities for each player for each season, but this is using FanGraphs' definition of opportunities (runner on 1st, less than 2 outs). We don't actually get the adjusted opportunities we need to calculate Rbr. FanGraphs doesn't give us much data either.

That's it for fielding; stealing bases, advancing bases, and avoiding grounding into double plays. The fielding metrics all make sense to me and closely match the theory and logic behind the batting metrics, but the lack of available data to the public makes it frustrating. If we were provided the data for all of a player's advancements/GIDPs and the league averages, along with a workbook or post outlining the proof of the run value for these events, I would be more pleased and convinced.

Fielding Runs
Fielding Runs is absolutely the main topic of contention for WAR (for me, at least). It seeks to measure a player's defensive value (in terms of runs) whilst playing in the field. FanGraphs uses a metric called Ultimate Zone Rating (UZR) for non-catchers, which you can read about in the primer here or in the base article here. It is also developed by Mitchel Lichtman (a co-author of The Book) and employs video tracking data from Baseball Info Solutions. UZR is similar to our other metrics in that it does weight a player's fielding events by their run value. We know the value of an out, as well as the value of failing to make an out (an error) or in allowing a hit to occur. However, UZR differs (in a way that I disagree with) by also weighing plays based on how 'difficult' it was to make the out. You may like this idea, but we need to be consistent with how we deal with 'difficulty' on the batting side. A lollipop from a position player on the mound is likely easier to hit than a low and away changeup from Pedro Martinez, but if both pitches resulted in a HR then they'd both be treated the same according to the rules of wOBA and thus wRAA and Batting Runs. Measuring such 'difficulty' would also be rather difficult in itself and open to a lot of subjectivity and interpretation. I get that diving catches are more difficult to make than flyouts right at you, but we don't make any such difficulty adjustment on the offensive side for a pitch's location/speed/spin rate, and being theoretically consistent throughout the calculation of WAR is important. UZR uses video scouts to go back and review game footage of plays, determining things like where balls were hit, the angle at which they were hit, and how hard they were hit. They then use that data and feed it into an engine to essentially determine how often a player across the league would make that play. A more difficult play is presumably one where it is less likely that an average fielder would have made the play.

With UZR, each fielder will either make the play and thus have a 100% probability of making the play, or fail to make the play and thus have a 0% probability of making the play. That is then compared to the probability that an average fielder would have made the play. Then, the difference is multiplied by the increase/decrease in run value of the play. In the UZR article linked above, FanGraphs uses an example of a fielder recording an out that only 25% of fielders would have made (so the average fielder has a 25% probability of making the play), which means our fielder is 75% above average (since he did in fact make the play). Then, FanGraphs has determined (through linear weights) that the average outfield hit is worth about .56 runs and the average outfield out is worth about -.27 runs (for the batter). We aren't shown the exact work of why or how an average outfield hit is worth .56 runs and an outfield out is worth -.27 runs, but from the linear weights we've discussed previously, Tango and wOBA had a non-strikeout out as worth -.3 runs and non-HR hits ranging from .474 runs to 1.063 runs, so these weights make somewhat sense. So the value of recording an out instead of a hit is -.27 - .56 = -.83 runs for the batter, or +.83 runs for the fielder. We would then multiply by .75 to get a total run value of .6225 for that play. Note that the 25% is the probability that any fielder would have made the out, not just a specific position. This is done so that players don't get docked if another fielder made the out. For this example, the probability that the average center fielder catches the ball was 15% and the probability for the average left fielder was 10%. If instead both fielders failed to make the out and a hit was recorded, then each fielder does get docked. The difference is now -.83 runs for the fielder, which gets multiplied by .15 for the CF to get a run value of -.1245 and multiplied by .1 for the LF to get a run value of -.083.

UZR classifies batted balls in 4 ways: bunt ground balls, non-bunt ground balls, outfield line drives, and outfield fly balls. UZR classifies the speed of each batted ball in 3 ways: slow/soft, medium, and fast/hard. Yes, infield line drives are ignored due to Lichtman believing they are more 'luck' than skill. Likewise, infield pop flies are ignored because most are caught and because of ball hogging issues (i.e. you making a difficult play that would have been easier for a teammate to make isn't impressive), as well as the belief that when such balls are dropped it is because of miscommunication or a fluke rather than a testament to the player's skill. I disagree with excluding both of these batted ball types. I'll have it on record that I dropped far fewer infield pop flies than my teammates back in my playing days, so I believe that to be a skill, and I'd encourage anyone that thinks catching a line drive is luck to go out there and try to catch a ball that came 100 mph off the bat.

UZR also considers the handedness and speed of the batter for considering if an out would otherwise have been a single, double, triple, etc. Failing to catch a ball could mean a single for Pujols but a triple for Ichiro. Some final adjustments are done based on the characteristics of the ballpark and the ground ball and fly ball tendencies of the pitcher. Because of the ultra specificness of all of these scenarios, there isn't really some singular UZR equation we have to use. A player's UZR is just the sum of the relative-to-average run values of all of his defensive plays.

UZR technically is split into 4 different parts, so you can think of it as an equation in that way if it helps.
To that end, we can write UZR as:
UZR = ARM + DPR + RngR + ErrR
The RngR is Range Runs above average and is essentially what we've discussed thus far. The ErrR is Error Runs above average and works very similarly, but assumes that the average fielder has a probability of 100% of making the play, so the run-value is purely the difference between a hit and an out. This is exactly how I think the errors should be measured. The DPR is Double Play Runs above average and accounts for a fielder's ability to turn double plays, simply measured as the # of double plays actually turned divided by the number of double play opportunities, and then compared to league average. It also considers the speed and location of the ground ball in question. ARM is Outfield Arm above average and accounts for an outfielder's throwing ability. It considers how frequently runners advance, stay put, or try to advance depending on the location and speed of the batted ball, as well as the ballpark in question.

I appreciate UZR in trying to think beyond merely fielding percentage as a defensive metric, but I think it deviates far too greatly from the other aspects of WAR and makes it even more complicated and less tangible and able to be recalculated by the general public. Just like UBR, FanGraphs will give us the values of UZR for each player, so clearly the data is somewhere, but they don't give us any more details. Here's another article on FanGraphs that dives into how they measure defense. I feel that a fielding approach more similar to wOBA would still be effective and superior to fielding percentage, while also not being as complicated and open to the results of an engine. We know the run value of an out (and even a GIDP or a non-SO out), so we could easily use those values and apply them to the outs (putouts and assists) that a player actually makes, and weight based on their defensive chances (putouts + assists + errors) or innings played. Then we can factor in errors the same way that UZR does. This would essentially be the same as calculating wOBA, but only if we used all hits rather than specific hit types. Given the data that I have, this is essentially what I plan to do for the player value metric that I am working on. To make things better, if we had enough readily available data we could determine the average run value of different types of batted-ball outs, from ground balls, to fly outs, to line outs, to pop flies, etc. Then we could tally up all the different types of outs that a player makes and weight them based on the run value of each type of out. UZR more or less does this, but applies too many adjustments and makes things too complicated and doesn't show us the work.

**Update 7/13/22**: Upon publishing this post, it came to my attention that prior to the start of the 2022 season, FanGraphs changed the range component (RngR) they use for Fielding Runs to be Fielding Runs Prevented, which is the Statcast/Baseball Savant Outs Above Average (OAA) converted to runs. This change is retroactively effective for all players from 2016 and on. Given the depth of this post as-is and the detail of these new pieces, I will simply link most of the references here and only say a little about them. You can read about FanGraphs' change here. FanGraphs discusses Fielding Runs Prevented and OAA here. The MLB Glossary defines OAA here. You can view the Statcast/Baseball Savant OAA leaderboard here, which also offers a short description of the metric. Tom Tango has a blog post where he discusses the outs-to-runs conversion a little here. Mike Petriello has an article explaining the expansion of OAA to include infielders here. He mentions a very comprehensive piece of writing from Tom Tango on fielding in that article, which you can find here. This post here by Tom Tango on the MLB Technology blog also discusses OAA, but it does cover a lot of the same info as the previous link. Essentially, OAA works a lot like the previous RngR metric, but to a superior degree. OAA is measured differently for infielders and outfielders. The baseline for outfielders is Catch Probability, which you can read about in the MLB Glossary here or in another article by Mike Petriello here. Statcast/Baseball Savant also has a page for Catch Probability here. As you could have guessed, Catch Probability is the likelihood that an outfielder will catch a given batted ball. This likelihood is determined by measuring 4 things using Statcast: the distance travelled by the fielder, how long he had to get there, the direction he had to move in, and whether he was close to the wall. Catching a ball right at you is easier than one 50 feet away, high fly balls give you more time to run 50 feet than screaming line drives, running 50 feet in to catch a fly ball is easier than running 50 feet backwards, and catching a ball whilst running into the wall is more difficult than not having to do so. All of this is superior to RngR because Statcast gives us the actual measured data of these events, leaving the subjectivity of a video scout out of the question. It simply uses the distance needed (optimal route to the ball) to catch the ball (not the distance covered, or actual route taken), along with the opportunity time to reach that distance (the time from when the ball leaves the pitcher's hand to when it lands/would have landed). Then difficulty adjustments are made for direction and wall proximity. Given these measurements, an expected Catch Probability is assigned in increments of 5%, so for instance no play has an expected catch probability of 27%. This tells us the probability that an average fielder would have made the play. Players get credited and docked for each play they make or fail to make. If you make a play with an expected Catch Probability of 75%, you get 1-.75 = +.25 credit, and if you fail to make a play with an expected Catch Probability of 25%, you get 0-.25 = -.25 docked. The sum of all of these gets added for each player throughout the season to get their OAA. For infielders, Catch Probability isn't considered, but the following factors and measurements are taken into account: distance needed to reach the ball (intercept point), time to get there, distance from base where out will be made, and average speed of the runner (for force plays). Statcast has measurements of Sprint Speed for every player; you can view the leaderboard here or read about it more here. We can measure how fast runners are, and obviously it's more difficult to get fast runners out than slow ones. Based on the different factors on the infielder side, an out probability is determined, which works basically the same way as Catch Probability. In Tango's MLB Tech blog linked above, you'll want to scroll down to the 'Probability Distributions' section to get a look into this. The sum of a player's differences from each out probability gives them their OAA for the year. The OAA to Fielding Runs Prevented (which is also called RAA or Runs Above Average) adjustment is based on the player's position. Looking back at the OAA leaderboard, you'll notice that generally players that play the same position and have the same # of OAA will have the same # of Runs Prevented; differences will be if one of the players plays multiple positions. However, players that play different positions but have the same # of OAA will generally have a different # of Runs Prevented. In his blog linked above, Tango quantifies this conversion as each out being worth .9 runs for outfielders and .75 runs for infielders. This more or less checks out with what we see in the OAA leaderboard, since values are rounded. Again, any differences are likely due to players playing multiple positions. So with the use of RAA instead of RngR, the Fielding Runs equation for outfielders and infielders from 2016 and on now becomes:
Fielding Runs = ARM + DPR + RAA + ErrR
Rather than just solely using UZR for Fielding Runs. This is an improvement. Any use of Statcast is an improvement for measuring modern player performance, because we actually have the technological capability to measure these events rather than infer them. This is similar to how using Park Factors based on Statcast data, where we can see if a ball would have been a HR in every park based on the park's dimensions and the ball's traveled distance, exit velocity, and launch angle, is superior to using calculated Park Factors based on certain parks just having a certain % of more homers. Using metrics based on measured, recordable data is always better. To this end, this update to Fielding Runs makes WAR an even better metric for comparing modern players. However, it is important to note that obviously this data wasn't available for Babe Ruth, so using WAR to compare 2 players when it's calculated differently for them is still an issue. Furthermore, OAA continues to adjust specific plays by difficulty, which we don't do on the batting side, leaving to inconsistency. **End of Update**.

For catchers, FanGraphs does not use UZR to measure Fielding Runs but rather uses Stolen Base Runs (rSB), and Runs Saved on Passed Pitches (RPP). By the names, you can probably guess what each seeks to measure. You can read up on FanGraphs' approach to catcher defense here. More specifically, you can about Defensive Runs Saved (DRS) on FanGraphs here; rSB is simply one component of DRS, which is an altogether separate defensive metric that is preferred by Baseball Reference. DRS is calculated by The Fielding Bible and John Dewan; you can find the book here. The FanGraph's DRS page doesn't really dive into much actual calculation, but they do reference The Fielding Bible website for a little more insight. As the calculations presumably get more complicated, most sources prefer to just provide mere descriptions of what they're doing rather than show the actual work and equations behind them. For a math guy like me, I find this infuriating. While I don't believe these people are just pulling numbers out of thin air, with the lack of proof of work and explanation of calculations, they probably could just make numbers up and easily get away with it. (Again, I don't think these people are making things up). As I mentioned earlier, there isn't much accountability among the baseball audience and most people don't really try to dig deeper into understanding these complex metrics.

Technically, both pitchers and catchers contribute to rSB. Pitchers can curtail steals by holding runners on effectively to ensure they don't get larger leads, as well as by throwing faster or just having a quicker delivery when runners are trying to steal. Catchers can't do much to curtail steals besides telling their pitcher to throw over or perhaps signal a pitch out, but they can actually throw the runner out. This of course is dependent on the catcher's pop time (how long it takes for them to catch the ball and get the ball into the fielder's glove), as well as the accuracy of their throw. I assume this metric works very similarly to wSB, but instead of rewarding the runner for a SB and docking them for a CS, the catcher gets rewarded for a CS and docked for a SB.

The other piece for FanGraphs' Fielding Runs is RPP, which you can read up on here. This is meant to measure a catcher's blocking ability, and uses pitch tracking data to analyze the difficulty of receiving specific pitches. They essentially don't trust official scorekeepers in deciding who is to blame (the pitcher or the catcher) when deciding whether a ball that gets by the catcher is a wild pitch (WP) or a passed ball (PB). By definition, WP is the pitcher's fault, and a PB is the catcher's fault, and both must involve a runner advancing a base. These also only measure failures, so we can't see the # of successful blocks a catcher has, but rather just how often he fails to block. The link above has a visual for the probability that a given pitch gets by a catcher, depending on its location. I think the visual is pretty cool and helps to understand RPP, so I'll go ahead and include it here. Credit to FanGraphs, The Hardball Times, and Bojan Koprivica.

As we can see, pitches right down the middle have a near zero probability, while pitches that are outside and either well above the strike zone or well short of home plate and in the dirt have a probability of about 30%. Essentially it will use these probabilities in a similar way as the fielding probabilities for UZR. If you actually blocked a pitch that only 70% of catchers blocked (probability of an average catcher blocking the pitch is 70%), then you have an increase in probability of 100% - 70% = 30%. This is then multiplied by the run difference of a successful and unsuccessful block. Tom Tango's earliest linked post has the average run values for each event type, but I'll link that again here. We can see that the WP has a value of .285 and the PB has a value of .284. For the purposes of RPP, all WP and PB are lumped together and referred to as Passed Pitches (PP) and given a run value of .28. So if you let a ball right down the middle get by you, you'll be docked basically 1*.28 = .28 runs, but in the earlier case where you caught a ball on one of the extreme corners, you'll earn .3*.28 = .084 runs. Overall I like this approach, but feel that there should be some type of border where the pitcher is to blame. Some pitches simply aren't blockable, and under this system catchers that fail to block the most extreme of pitches still get docked .7*.28 = .196 runs each time, since 70% of catchers block pitchers in the extreme corners, which encompass all zones further outside of them. You get penalized the same amount for failing to block a ball just in front of home plate as you would if the pitcher literally spiked the ball into the ground or threw it into the stands. I also think more work has to be shown as to how these probabilities were derived.

So FanGraphs measures catchers' throwing guys out and blocking pitches well, but doesn't really give us all the data we'd want to properly follow along with their final numbers. However, they don't measure any other type of catcher fielding (bunts, pop outs, tagging guys out at the plate, etc), which is... odd. They are striving to measure the framing skill of a catcher (ability to dupe the umpire into calling an actual ball as a strike), but haven't yet gotten there. If robo-umps get implemented, this won't really be a skill anymore.

**Update 8/22/22**: The statement above where I said that FanGraphs had not yet incorprate cathcer framing into WAR as of July 2022 was incorrect. FanGraphs actually added framing to their WAR in March of 2019, which you can read about here. They just hadn't updated their articles that explain WAR. You can read further into how they calculate catcher framing here. They created models that predict the probability of a pitch being called a strike based on its count and location, versus both right handed and left handed batters. They then credit catchers for the additional strikes that they get called in excess of the amount that would be predicted. Each additional strike is said to be worth about .135 runs. The total of these are said to be the catcher's Framing Runs. On the catcher side, these just get added to their total runs, which are used to convert to wins to eventually get their WAR. On the pitcher side, their catchers' Framing Runs per 9 innings are added to their FIP when computing pitcher WAR. **End of Update**

Baseball Reference entirely relies on DRS for its Fielding Runs (which it calls Rdef.) for all players 2003 and on. For players before then, Baseball Reference uses Total Zone Rating (TZR). This is problematic to me because while DRS may be more accurate and applicable to compare current players with, it is tricky to compare the WAR of two players from different eras when we are measuring their defensive skills in different ways. Baseball Reference doesn't show how DRS is truly calculated, but does mention the 8 factors that are considered. It's really not all too dissimilar from UZR; for instance, the first factor is Fielding Range Plus or Minus Runs Saved, which is based on video tracking data (batted ball location and speed) provided by Baseball Info Solutions. Then there's an outfield arm component, also based on the speed of the batted ball and the number of guys thrown out versus not thrown out. There's also an infield double play component based on the # of double plays turned compared to the # of double play opportunities, while considering the speed of the batted ball. For catchers, it considers their bunt fielding and their ability to throw runners out, while considering the role pitchers have in preventing steals as well. There's also a more subjective-sounding 'catcher handling of the pitching staff', which is based on things like the pitches they call and their framing ability. Lastly, there are 'good play' values for 28 positive play types (such as robbing a HR or blocking a pitch in the dirt) and 'bad play' values for 54 negative play types (such as missing the cutoff man or pulling your foot off the bag). It all sounds pretty comprehensive and grand, but we're not shown how it actually works in action, and again we don't have all this data for older players so it's ignorant to use it when comparing them.

TZR also suffers from this data comparison flaw, as it relies on as much data as is available for each season. Here's an article from Baseball Reference that talks a little about the TZR system. The 'total zone' idea is basically the percentage of balls that are hit to the fielder that are turned into outs. A lot of data is unknown for the actual hits, such as exactly how many balls were hit toward the third basemen that were recorded as hits. TZR uses 3 different methods to approximate this depending on the year and the data available. One 'method' basically has the data that already tells you who fielded each ball and where it went by (which fielder is to 'blame'). For example, I know my LF fielded a grounder to the outfield that went by the shortstop. Another method knows who fielded the ball, but we don't know who quite to 'blame'. For instance, for a ground ball single to left, was it the third basemen or the shortstop that had the opportunity to make a play? Since this information is unknown, the responsibility is split between the two. The last method is used when we don't even know who fielded the hit. We can look high level and determine that say 30% of all outs are to the shortstop, and then assume that 30% of all hits must be towards the shortstop as well. I understand and more or less agree with the methods here to determine roughly how many hits to 'blame' each position for, but I disagree more with the blaming to begin with. A lot of hits are just not possible to turn into outs, and I don't think fielders should be docked for 'failing' to do so. Rather, I feel that hitters should be rewarded for the outs that they do make, and then docked for failing to make outs that we'd expect them to make (errors). As for exceptional plays, those should be present in our fielders by simply recording more outs; if you made a diving catch and someone else didn't, you'd have more outs.

Apart from the Total Zone Runs/fielding range part of TZR, there are also the standard pieces of outfield arms, double plays, and catcher data. These work very similarly as previously discussed. Each of these pieces are added to get the final TZR.

That is it for fielding! Since Baseball Reference does consider a catcher's basic fielding abilities in addition to his throwing and blocking, I favor its calculation of Fielding Runs over FanGraphs'. As you may have noticed, measuring fielding is much more complicated and has much less straightforward equations for us to follow along with. It still uses the idea of the run value of events, but adds in my opinion way too much detail that isn't matched on the batting side. The manner in which Fielding Runs are currently measured by either party makes WAR far more complicated and also makes it even more troublesome when relying on WAR to compare players of different eras. I support the evolution of WAR and the use of it in the present to try and best measure player value, but please do not rely on it to compare players from different eras. What we need is a more simple calculation that is still effective and more consistent across time.

Positional Adjustment
The idea of a positional adjustment shouldn't come as a shocker to anyone; clearly, some positions record more outs than others and some positions hit better than others. My proposal to scaling this would be to always compare a player to his position's league average, rather than the league-wide average across all positions. However, FanGraphs and Baseball Reference do something else for WAR. FanGraphs actually words its positional adjustment a little interestingly. It acknowledges that Fielding Runs are already scaled to the position's average, but that some positions are just harder to play than others. This means that FanGraphs believes that an above average shortstop is worth more than an above average first basemen, since it is easier to defensively play first base. There is no mention by FanGraphs for adjusting by position for offensive purposes, so presumably an above average hitting second basemen doesn't mean much compared to an average outfielder. They talk a little more about why they don't use an offensive adjustment here, as well as go into more detail about the adjustment they use. They also reference some analysis done by Tom Tango on this matter, which you can view here. He essentially compares the fielding ability (measured by UZR) of players that play multiple positions, and how it varies at one position versus another. If a guy that plays LF and CF has a higher UZR when he plays LF than when he plays CF, then we assume it's easier to play LF than CF. However, FanGraphs' values by position below vary somewhat notably from what Tango produced, and they fail to show their work as to why that is the case. FanGraphs applies the adjustment by just tacking on or removing a certain # of runs depending on the player's position. Here are the runs that are added/subtracted for each position, from FanGraphs:

Since not all players solely play a single position the entire year, FanGraphs adjusts this run addition/deduction based on the proportion of innings that the player plays at each position. For each position played, you'll take the # of innings you played at that position and then divide by the total number of innings you could have played at that position (every inning of every game, or 9 innings per game for 162 games, which is 1,458 innings). That gives us the % of innings that you spent playing the position. We then multiply by the positional run value per 162 defensive games, to get the equation below.

Note that as mentioned above, a 'defensive game' is defined as a full 9 innings. The position specific run values listed above are assuming you played an entire full season at one position, so if that's not the case we must adjust your positional adjustment by the amount of time you were actually playing that position. Lastly, you would just sum the positional adjustments up for all positions played that year to get your final positional adjustment.

Baseball Reference refers to its positional adjustment as Rpos and handles it a little differently. Their adjustment values are different, but they apply it in nearly the same way. Instead of dividing by 9*162 (1,458 total innings) for each position, they divide by 9*150 (1,350 total innings). This may make more sense because it's more likely that a player will play 150 entire games at a position than he would play literally an entire full season at one position. To this extent, Baseball Reference's position specific run values are per 1,350 innings played, rather than per 162 defensive games played as is the case with FanGraphs. Unlike FanGraphs, Baseball Reference does consider the different positions' offensive value in addition to their defensive value. Here's the table they provide supporting the notion that some positions are more offensively inclined than others:

All of the numbers above are assuming 650 plate appearances. Acknowledging these differences and quantifying them (presumably in a wOBA-like way), along with the changes in fielding performance when players change positions, Baseball Reference arrives at the following run value adjustments for each position, per 1,350 innings played:

In general, Baseball Reference thinks that corner outfielders, designated hitters, first basemen and third basemen are a little better than FanGraphs does. This isn't surprising given that these are the better hitting positions. Again, the final positional adjustment works essentially the same way; these position-specific run values are just different, and we divide by 1,350 instead of by 9*162. There is a slight caveat with Baseball Reference in that they ensure that the league's total positional adjustment sums to 0. When this isn't the case, they assign some more runs to players based on their playing time.

League Adjustment
This adjustment follows the notion that the American League and National League are not equal each year. As I mentioned previously when a similar adjustment is applied to wRAA, I don't support such a league adjustment. The goal is to have each league's (AL or NL) run value above average sum to 0. Up to this point (Batting Runs + Baserunning Runs + Fielding Runs + Positional Adjustment), if you did this for both leagues that may not be the case. The league adjustment tells us how many additional runs per plate appearance we need to add to the league total to force the league's run value above average to be 0. We then multiply that additional required R/PA amount by each player's # of PAs, for each player in the respective league. The equation looks like this:

All of the lg values are the respective league's average values for each of the WAR components we've discussed thus far, as well as plate appearances. There is a negative because players that play in leagues that need R/PA added to get them to 0 will be docked, and players that play in leagues that need R/PA taken away to get them to 0 will be credited. You get kudos for playing in a difficult league and you get docked for playing in an easier league. We then multiply by the player's PA in each league, so this works for guys that get traded across leagues during the season such as Mark McGwire in 1998 as well.

Baseball Reference does adjust for league, but encompasses it into its Replacement Level calculations, which we will discuss next. Hey, this WAR component was pretty easy, albeit I don't agree with its existence.

Replacement Runs
Up until now, all the work has been determining the relative # of runs a player is worth above average, and then applying some adjustments. However, WAR of course stands for wins above replacement. So how do we go from above average to above replacement, and better yet, why? FanGraphs essentially lists 2 reasons as for 'why'.

First, they state that being average has value. I agree! Is that necessarily a reason why we can't compare to average? I disagree. Society and baseball fans are smart enough to realize that a player worth 0 runs is better than a player worth -20 runs. We don't need everything scaled so that any sort of value is always positive. And while being average is fine and does have value, certainly being above average is preferable. To that end, a team can realize that they have an average player, which again is fine, but that there is room for improvement. They also can realize that they have a player that is below average and worth switching out; we don't need to compare to whatever 'replacement' is to determine these things.

Second, they state that comparing to average doesn't allow us to differentiate between players with few plate appearances and many plate appearances. Hmm. Let's just focus on wRAA. If my wOBA is .300 and the league average wOBA is .300, then regardless of my # of PAs, my wRAA will be 0. Funny thing is, plate appearances are actually a readily available, recorded statistical event in baseball. Given two guys with a wRAA of 0, we can quickly look at their respective # of PAs to get context into if one player played vastly more than the other. The idea behind 'replacement level' is that we could set the baseline wOBA to instead be something abysmal like .150. Then if one player had 500 PAs and the other had just 10 PAs, and assuming a wOBA Scale of 1 for simplicity, and both players still had a wOBA of .300, the first player's wRAA would be (.300 - .150)*500 = 75, but the second player's wRAA would be (.300-.150)*10 = 1.5. We are recognizing that on a rate basis (aka according to wOBA), both players have been equally good, but that the first player has overall provided more value since he performed at that level for a longer period of time. I can understand the mathematical appeal to this, but again we can easily look at a player's PAs to understand how valuable his being average actually was.

Moving forward, what even is replacement level? It is defined as the quality level of a 'freely available' player, meaning someone that an MLB team could call up or procure on a whim. That's not you or me, but rather a bad MLB bench player or a minor leaguer. But replacement level isn't just a description, it actually is a defined amount. FanGraphs writes about their rationale of replacement level here. In the article, they define their replacement level quantitatively as a .297 winning percentage, which across a 162-game season would be equal to about 48 games. Why do they use a .297 winning percentage? They don't tell you. Granted, FanGraphs doesn't tell us a lot about their beloved WAR. Their article that is intended to explain WAR fails to adequately do so, leading you on a wild goose chase of other links across the internet. Even their more 'thorough' pages fail to show their work for calculating different components. When they do mention some numbers, I generally had to go review the work of Tom Tango, who is smart enough to recognize that readers actually like to see why certain values are what you claim them to be (granted, Tango could still show more work and use some website design guidance). Despite many Google searches, I just couldn't find why this replacement level was set to what it was. I found that FanGraphs and Baseball Reference used to employ noticeably different replacement levels of .265 and .320, leading to starkly different WARs for some players. The two sides met together and agreed on a universal replacement level to help quiet the attacks on WAR and win people over. Cool. Why .297? Seemingly because it was about the midpoint of what the 2 sides were at before, but there surely have to be other reasons. I found an article on Baseball Prospectus (who again has their own metric, WARP) that attacked Boston journalist Bob Ryan on criticizing this very thing; they still failed to show their work as for the why and only offered mere descriptions. Baseball Prospectus actually does describe replacement level in a way that makes more sense than how it is actually used by FanGraphs and Baseball Reference. BP seems to suggest that they look at the instances of when backup players at a given position played, and found the average performance of the backups. The idea for using the average is that starters with better backups shouldn't be penalized. Despite this logical description, BP still fails to show any work to back it up, and furthermore this isn't how FanGraphs or Baseball Reference appear to be doing things. How do they do things? Again, they don't tell you.

So desperate was my search for the rationale of the .297 winning percentage that I finally took to asking the question myself on the r/Sabermetrics sub-Reddit. My post got several upvotes before any response came; supposedly this is a forum of people like me who enjoy the statistics of baseball, but many of them aren't even attempting to understand the complexities of WAR before falling in love with it. Fortunately, user BarristanSelfie was able to provide a pretty solid explanation. He stated that the baseline for the .297 winning percentage was the 1962 Mets. The Mets that season went 40-120-1, for a winning percentage of .250. So, we assume a team full of replacement level players would be slightly better than one of the worst teams in baseball history. Other more recent atrocities include the 2003 Detroit Tigers that went 43-119 (.265) and the 2018 Baltimore Orioles that went 47-115 (.290). But none of these teams actually went .297, and neither has any MLB team in history, so why did we decide on this amount? The truth is that the answer was backed-into, like an Excel GoalSeek solution, to get the answer that they wanted to work to indeed work. And given the worst historical records in history, it seemed to all make sense.

There are currently 30 teams in the MLB that each play a 162-game season. Each game involves 2 teams, only one of which can win. This means in total we have (30*162)/2 = 2,430 total games, and thus 2,430 available wins. FanGraphs and Baseball Reference both have a total WAR allotment of 1,000 wins per 2,430 games played. This means that they believe there are 1,000 wins above replacement there for the taking. This implies that there are 2430 - 1000 = 1,430 replacement-level wins that will be taken as a default. Divided across 30 teams, that's 47.67 wins per team, which is about a .294 winning percentage across a 162-game season. (The FanGraphs replacement level page mentions the .297 winning percentage, but most other sites and figures seem to suggest an actual .294 winning percentage). But that still doesn't quite explain why they use the replacement level that they do.

For kicks and giggles, let's just *assume* that an average level player would have a WAR of 2. That means the starting lineup (including a DH) of our average team would be worth 18 wins above replacement. Let's say a starting pitcher is worth 200 innings (really only 4 guys did this in 2021, but 61 guys did this in 1976). A team will play 162 nine inning games, for a total of around 1,458 innings pitched. You can check the total innings pitched of our crappy teams linked above; the 2018 Orioles had 1,431 innings pitched. If we assign 2 WAR for every 200 IP (i.e. for each starter, and the rest to the combined amount for relievers), we see that we get 1458/200 = 7.29 * 2 = 14.58 WAR. Combining this with our position player starters, we get a total of 14.58 + 18 = 32.58 wins above replacement. Remember that this is an average team. The theoretical assumption of WAR is that an average team would go .500, and thus win 81 games in a 162-game season. So if our average team is worth 81 games, and 32.58 of those wins are in excess of replacement, then we could expect the replacement level team to win 81-32.58 = 48.42 games. That would imply a winning percentage of .299, but would also imply that there are 30*32.58 = 977.4 wins above replacement available. The makers of WAR they prefer the nice round 1,000 wins above replacement available to distribute amongst all players, so they round up to that amount, which reduces the replacement level winning percentage to .294. So we don't really have a good defined way for why we use .294; we just made an assumption, see where it got us, adjusted to a nice round number, and then deemed it satisfactory since it's roughly the winning percentage of the worst teams in MLB history.

Now that we've covered the 1,000 wins above replacement that are available for all players, we must separate them between position players and pitchers. FanGraphs allots 570 wins (57%) to position players and 430 wins (43%) to pitchers. This is presumably because most teams spend 57% of their available funds on position players, and the remainder on pitchers. But FanGraphs doesn't provide any data to support this, nor do they actually state that this is the reason. Rather, Baseball Reference uses similar splits of 59% and 41%, and uses this explanation of salaries of position players vs pitchers.

So given that we have 570 wins above replacement for position players out of the 2,430 wins available, we can finally calculate replacement level runs for position players as follows:

MLB Games are the total number of games played by all teams in the MLB thus far in the season. This allows us to calculate the Replacement Runs during the season. This is because not all 570 position player wins above replacement will have been allotted until the season is completely over. If not all wins have took place, then not all wins above replacement could have taken place either. If doing this after a full season, then the fraction would just become one since the MLB Games would equal 2,430.
Runs Per Win will come into play in the denominator of the overall base WAR equation, but is essentially about how many runs you need to win a game, i.e. how many runs each win is worth. lgPA is the league average number of plate appearances. Then of course we have the PAs for our player in question.
So big picture, this equation is seeing how far we are into the season to see the % of wins above replacement that we currently have available to distribute; then we multiply by the number of runs per plate appearance that a league average player is getting, and then by the number of plate appearances the player in question has. This quantifies for us the difference in an average player and a replacement player, given a certain # of plate appearances.

As mentioned previously, Baseball Reference includes its league adjustment within their replacement level calculation. In addition to the division of 590 wins to position players and 410 wins to pitchers, they also divide the wins between the NL and the AL based on their relative quality. For example, in 2019 the NL was given 475 wins and the AL was given 525 runs. In 1950 the NL was given 279 wins and the AL was given 228 wins; the total here doesn't add to 1000 since there were less teams back then, so there were less wins to be had. At the bottom of the Baseball Reference WAR page that I linked at the beginning of this post, they have a table of the win splits by league each season. They also mention a blurb that highlights the iterative nature of determining replacement level: "After we make a first pass through the calculations, we determine how the league's current total WAR differs from the desired overall league WAR. We then add or subtract fractional replacement runs from each player's runs_replacement total based on their playing time, and recompute WAR_rep with this adjustment included".

My last final rant about replacement level vs average as comparative baselines is that the concept of average has existed in the history of mathematics for many, many years. Feel free to read up on the idea of average here. Whether it be the median, mode, arithmetic mean, geometric mean, or even harmonic mean, there are many ways to calculate what is 'average'. To be average is to be typical and indicative of most of the group. There is no such mathematical concept for 'replacement'. It is a purely arbitrary, back-end solution to a mathematical problem. While really no 'average' players really exist, the average is calculated from and indicative of actual data from the group. No 'replacement' players really exist either, but the replacement level is not calculated from and indicative of actual data from the group. It's just a number they derived to suit their needs and checks out with the winning percentage of the worst teams. FanGraphs has an article here discussing some real-life replacement level player examples. So we grabbed 24 players and they each had a WAR around 0 (replacement level); this doesn't necessarily mean that all such players in history would have this WAR, and again this only shows that replacement level is more or less something we defined as what these guys played at, rather than a more dynamic mathematical concept that is representative of a group. If replacement level were defined more so along the lines as the bottom 10% or 25% of the group, then I'd be more convinced. The only advantage of replacement level is that it works better in a particular equation that was developed to solve the 'quality' vs 'quantity' debate of "how do we measure being great in the short term versus being good for a longer period of time?"

Runs Per Win
The final core component of WAR for position players is found in the denominator and seeks to measure how many wins a player is worth based on how many runs he is worth. We divide by Runs Per Win because we seek to convert a player's contributions, as measured by runs, into his contributions as measured by wins. Before I dive deeper into this conversion, I'll list out my 3 main criticisms of this final step.

For one, Runs Per Win isn't an actual conversion. Definitively, there are 60 seconds in a minute. There are 9 innings in a baseball game, and there are 3 outs in each half-inning. This isn't news to us; we know these things. One not as familiar with baseball stats probably couldn't tell you how many runs are in a win. That's because it is NOT a definitive conversion. There is not a set # of runs that a team must reach in order to win the game. Nor is there a mercy rule in the MLB; there is no # of runs that a team can score and automatically win the game. Rather, you simply must score more wins than the other team in order to win. You can score 1 run and win, or you can score 25 runs and win. Each of them equals a win. Proponents of WAR believe that wins are the currency of baseball. I disagree, and would argue that runs are the currency of baseball. Within the context of an individual game, runs are all that matter, not wins. Within the context of a season we may care about how many wins each team has, but they only got those wins because they scored more runs than their opponents in those games. In game 7 of the World Series, all that matters are how many runs each team has, not how many wins each team had in the regular season and postseason up to that point. While we can expect teams that score more runs (and allow less runs) to win more, there is NOT a guarantee that X runs is equal to a win.

Second, a player cannot be equal to a win. It is very possible for a player to score a run all on his own; all he has to do is hit a home run, or in a more extreme fashion he could hit a triple and then steal home. Players may need additional help from teammates to score runs, but it's normally just 1 or 2 other players that assist in helping that run be scored. Players step on home plate and score the runs themselves all the time, every game. It is virtually impossible to attribute an entire win to a single player. To do so would involve an extreme effort even difficult for Shohei Ohtani, whereby he must pitch a perfect game or no hitter where the only types of outs he gets are strikeouts (or balls hit right at him), and then hit a home run without anyone else on his team scoring. Even then, he needs the help of his catcher in getting those outs. This 'solo win' simply doesn't and never will occur. Baseball is a team sport; good players get left on bad teams and miss the playoffs and the World Series all the time. How many times have we seen Mike Trout and Shohei Ohtani both play superbly this season and the Angels still lost? They simply can't win a game for their team by themselves, but boy can they score some runs. A single player can't give his team a win, so it doesn't make sense to believe that a player is worth a certain # of wins. A player can give his team a run, however. He could also save his team of a run, such as by robbing a homer.

Third, the need to convert to wins is unnecessary. We already did all this work to value players based on runs. Just use that as the metric. Teams know what runs are, and we know that more runs is preferred to less runs. The player with more runs is the better player; there's simply no need to then translate into wins and determine that the player with more wins is the better player. It's simply a waste of effort.

Despite my criticism, WAR does in fact convert each player's runs above replacement into wins above replacement. The exact number of Runs Per Win changes each year based on the run environment and is normally between 9 and 10. This figure is based on the average # of runs that a team needs to score per additional win. Put another way, it is the slope of the linear regression line of Runs Scored vs Wins (using wins to predict runs scored, in this case). For a 1 unit increase in wins, about 10 runs are needed. We can also simply interpret this as each seasons total # of runs scored divided by the total # of wins (which will be 2,430 in a 162-game season). That chart over time looks like this:

Note that the Y axis above is Runs divided by games divided by 2, since the dataset used counts each team's win and loss as a game. Thus 1 actual game comes up as 2 games; a game for the winning team and a game for the losing team. Nonetheless, we clearly see that runs per win hovers around 9 to 10 over time for the last 100 years. This seems to refute the notion that yearly adjustments for environment are needed as well. From 1920 onwards, Runs Per Win has a mean of 9.51 and a median of 9.11. Alternatively, using the simple linear regression approach on an individual team basis of predicting wins using runs, we get the following plot:

The equation for the regression line is Y = 0.084462x + 18.975103. This means that we expect to win about .08 more games for each additional run that we score. This comes out to needing to score about 11.84 additional runs to win a game. A little higher here, but given the fact that we used every team's data for each season, rather than the league average each season, we can expect to see more variance in our results.

So we need around 10 runs to 'convert' to wins and get the final WAR values we want. FanGraphs uses the following equation for its runs to wins conversion:

The 9/Innings Pitched part essentially makes this Runs Scored per game, and then there are some adjustments done on the end, presumably to translate this into the runs needed to win the game.

You can read about Baseball Reference's approach to Runs Per Win here. It's nothing too different; you're still gonna get something between 9 and 10 runs per win.

Once runs have been converted to wins, WAR is complete for position players! Given the bulk of material thus far, you may want to call it quits or skip to the bottom, but if you're interested in seeing how the WAR calculation is different for pitchers, we will press onward.

But first, a quick summary of position player WAR:

WAR considers Batting Runs, Fielding Runs, Baserunning Runs, a positional adjustment, a league adjustment, Replacement Runs, and then converts runs to wins
Batting Runs uses wRAA which is based off of wOBA, which more properly weights the different offensive events based on their run value. We also use park factors since some parks are easier to hit in than others
Fielding Runs uses DRS/TZR or UZR (*or OAA/RAA) for non-catchers, and different metrics for catchers. For catchers we care more about their ability to block and throw runners out. For other position players, we weight the outs based on their run value and adjust for how difficult it was to make the out
Baserunning Runs used wSB (base stealing ability), UBR (base advancing ability), and wGB (ability to not hit into groundball double plays/beat ground balls out)
The positional adjustment is done because it is easier to be a good defender at some positions compared to others, and some positions are generally better hitters than others
The league adjustment is done because the quality of the AL and the NL may not be equal each season
Replacement Runs is used to convert our 'above average' run values to 'above replacement' run values, where replacement level is the quality of player that a team can quickly procure and allows us to consider quality vs quantity in a formulaic way
The runs to wins conversion is used to measure player value in a tangible way

WAR For Pitchers
Yes, WAR is measured differently for pitchers than it is for position players. In my opinion, Baseball Reference's WAR calculation for pitchers is markedly superior to FanGraphs'. Both sites start with a standalone pitching metric that serves as a replacement for ERA, which they both believe to flawed. I find FanGraphs' baseline pitching metric for pitcher WAR to be highly flawed as a standalone metric.

You can read about FanGraphs' approach to pitcher WAR here. The core component of their pitcher WAR is a metric called Fielding Independent Pitching (FIP). You can read about FIP here. I appreciate FIP in that it is a wOBA-like approach to measuring pitching, but man do I hate the things that it cuts out. The idea of FIP is inherent in the name; there is belief that the runs a pitcher allows to score are not entirely his fault, but also dependent on the fielders out there with him. This makes sense; surely, fielders messing plays up will allow runs to score. Fortunately, there is already a baseline traditional statistic (that I'm sure many of us are familiar with) called Earned Run Average (ERA). You can read about that here if it's a new concept to you. You see, we don't judge a pitcher based on the # of runs that he allows, but rather by the # of earned runs he allows. An earned run is a run that scored not due to an error or a passed ball. If the catcher fails to block a ball he should have and the runner on 3rd scores, the pitcher doesn't get blamed. If the left fielder drops a routine flyout and the runner on 3rd scores, the pitcher doesn't get blamed. If the shortstop lets a ground ball go between the legs and that guy eventually goes on to score, the pitcher doesn't get blamed. ERA already adjusts pitcher performance for obvious fielding miscues. So, why do we need something else?

The main notion for additional fielding refinement is that pitchers with good defenses will benefit in other ways outside of less errors, and conversely pitchers with bad defenses will be hurt in ways outside of more errors. Good defenses will turn otherwise hits into outs, meaning making plays that wouldn't have been errors had they failed to make them. Bad defenses will fail to turn these into outs, meaning the play goes down as a hit and not an error. Furthermore, we can apply our more typical adjustments of league, ballpark, and position (starter vs reliever) to seek to improve upon ERA.

FIP takes any fielding completely out of the equation. It only considers situations where the pitcher has entire control over the outcome (besides the catcher, who still needs to catch pitches that he obviously should). To that end, FIP only considers the events of a home run, a strikeout, a walk, and a hit by pitch. Any ball that enters the field of play and requires to be fielded by a player (including the pitcher!) is ignored. Gee, that's one way to adjust for the quality of the defense behind the pitcher. Here's the equation for FIP:

FIP works just like ERA in that a lower value is better, and thus good events for the pitcher like strikeouts are subtracted, and bad events like a home run are added. You'll notice that the weights used in this equation are interestingly different from the run-value weights we determined for wOBA. Why is that the case? Naturally, FanGraphs fails to explain. John in the comments of the FanGraphs post even asked about this, and was brought to shame for daring to question the values ("Are you really so naive as to believe they just pull these numbers out of their collective ass?"). Well no, but some proof would be ideal. Thanks for daring to seek further answers, John. Five years ago, another brave soul had to take to the r/Sabermetrics sub-Reddit to ask the question of where the weights come from since FanGraphs routinely fails to provide baseline necessary information. Fortunately, Tom Tango himself came to the rescue with a link to a blog post of his explaining the weights.

Tango starts with the actual run values of the relevant events (HR, K, BB/HBP, and BIP for ball in play) per plate appearance. These are about 1.4 for the HR, .32 for the BB and HBP, -.28 for the K, and -.03 for the BIP. He then shifts them up by .12 to get the run values per game. This is because the average pitcher allows .12 runs per plate appearance. For pitchers, PAs are really BFs (batters faced), but you roughly see this by looking at the 2010 Reds pitching stats here. The Reds' pitchers that year allowed 685 runs and faced 6,182 batters, which comes out to about .11 runs per PA, not far from what Tango used. I'm not sure what dataset Tango was working with here, but presumably it came out that the average pitcher allowed .12 runs per batter faced. This shift makes the values now 1.52 for the HR, .44 for the BB and HBP, -.16 for the K, and .09 for the BIP. Then, since FIP doesn't consider balls in play, he shifts the weights back down by .09 runs so that a BIP is worth 0. At the same time, he weights each PA by .09 runs as well (this will become the FIP constant). This makes the values now 1.43 for the HR, .35 for the BB and HBP, -.25 for the K, 0 for the BIP, and .09 for each PA. Tango then multiplies the PA weight by 38.5, stating that there are about 38.5 plate appearances per game. We can look at the 2010 Reds link above and see that they had 6,285 PAs in 162 games, which comes out to about 38.8 PAs per game, so this number from Tango checks out. Multiplying .09*PA by 38.5 runs per PA eliminates the PAs and makes this a constant of 3.465 runs. In the penultimate step, he multiplies each of the weights by 9 since there are 9 innings pitched per game, and the FIP equation uses the run values per inning pitched rather than per game. This makes the weights 12.87 for the HR, 3.15 for the BB and HBP, still 0 for the BIP, -2.25 for the K, while keeping the FIP constant of 3.465. The final step is to convert from runs to earned runs, which Tango does by multiplying each of the values by .923. The 2010 Reds pitchers gave up 648 earned runs to 685 runs, so this value would be .946, but it makes sense that it is higher given that the Reds were an above-average team that year and made the postseason. This final adjustment makes the values 11.88 for the HR, 2.91 for the BB and HBP, -2.08 for the K, and 3.2 for the constant. Rounding up, this would give us 12 for the HR, 3 for the BB and HBP, -2 for the K, and 3 for the FIP constant. As Tango suggests in his post, he thinks that the HR should indeed be 12 rather than 13, and that the constant and use of values per IP instead of per PA is questionable. Nonetheless, this is the closest we get to understanding why the FIP weights are what they are. FanGraphs uses 13 for the HR and doesn't show or tell us why. You'll notice that the values match better before we applied the earned run adjustment, so maybe FanGraphs doesn't employ that step.

FanGraphs lists out the FIP constant values for each season here, the same place where they define their wOBA Scale and weights for each season. The FIP constant can be determined by us though, since FIP is designed so that league average FIP matches league average ERA, much like how league average wOBA matches league average OBP. Here's the equation to get the FIP constant:

So we just take the difference between the league average ERA and the otherwise-would-be league average FIP, and by adding that difference to FIP we ensure that the league average FIP and league average ERA are the same. They put FIP and ERA on the same scale so that people know what a good FIP is. Obviously, learning what makes a good FIP would be way too difficult, so nowadays every stat gets scaled to a scale we're already familiar with (like ERA) or with 100 being average. We learned the scale of what makes a good ERA somehow...

With the formulaic technicalities of FIP out of the way, let's discuss its shortcomings. FIP does do a good job in eliminating the effect of defense on the ability of pitchers to not allow runs. However, it ignores many events that I believe the pitcher is still to blame for. Let's consider 2 (albeit rather extreme) examples to illustrate what's wrong with FIP:

We have 2 pitchers, both of which have thrown a complete game and thus recorded 27 outs. We'll assume it's 2021, so our FIP constant is 3.17. The first pitcher did not strike anybody out, but every out was either an infield pop fly or a weakly hit routine ground ball. He also gave up one home run, walked one batter, and didn't hit anyone. In this situation, the pitcher would have an ERA of 1, but a FIP of (13*1 + 3*1 +3*0 - 2*0)/9 + 3.17 = 4.95. So FIP thinks this pitcher is much worse than ERA does. Do you think a 1 run complete game performance is bad? The second pitcher instead struck everybody out (all 27 outs he got were Ks, wow!). Furthermore, this pitcher didn't give up any homers, and didn't walk or hit anybody. However, we'll say that each inning he gave up a double, followed by a triple, and then a single, so 2 runs score each inning. That means for the full game, he allowed 18 runs to score, giving him an ERA of 18. Terrible. His FIP however would be (13*0 + 3*0 + 3*0 - 2*27)/9 + 3.17 = -2.83. Stellar!

Would you rather have the 1 ERA and 4.95 FIP pitcher, or the 18 ERA but -2.83 FIP pitcher? Hopefully the answer is clear. I think FIP has worth in using along with ERA, showing the implications of defense on ERA. It can provide some context for pitchers' ERA. For example, if two pitchers had the same ERA but one had a lower FIP, we could prefer the pitcher with the lower FIP. However, I believe that using FIP in replace of ERA is absurd. FIP completely discounts pitchers that are able to force weak contact and make batters pop out and hit into ground outs, and unjustifiably rewards pitchers that get absolutely smacked, as long as the hits occur within the field of play. A pop out to first isn't some great play by the first basemen that the pitcher should lose credit for, and a double off the wall isn't some fielding failure that should have the blame moved from the pitcher to the fielder. What's the solution? I think something along the lines of wOBA against the pitcher is honestly the best way to go. We see the run values of events, and we include all of the events.

Now that we've covered FIP, let's move onto how FanGraphs calculates its WAR for pitchers, based around FIP. Fortunately, FanGraphs is smart enough to realize that solely relying on FIP as-is would be a poor approach to measuring pitcher skill, so they apply some adjustments. First, they factor in infield pop-flies by treating them as strikeouts in the FIP equation. This makes sense because getting batters to pop out is certainly a skill of some pitchers, and the resulting run scenario is similar to that of a strikeout; you increased the # of outs, and you didn't advance anyone. Here's an article about why they included infield flies in FIP for WAR. Why don't they just do this with FIP in general? Sigh. This adjustment equation is almost exactly like the FIP one already listed above, we just also subtract by 2*IFFB in the numerator, and our FIP constant is a little different. The constant is different because adjusting the otherwise-would-be FIP will make its difference from the league average ERA slightly different, so we'll have to add a slightly different amount in order for the league average adjusted FIP to match the league average ERA. IFFB is the # of infield fly balls the pitcher had, by the way. FanGraphs refers to this infield pop fly adjusted FIP as ifFIP. Here's what these equations look like:
ifFIP = ((13*HR)+(3*(BB+HBP))-(2*(K+IFFB)))/IP + ifFIP constant
ifFIP Constant = lgERA – (((13*lgHR)+(3*(lgBB+lgHBP))-(2*(lgK+lgIFFB)))/lgIP)

For pitcher WAR, FanGraphs wants to adjust the scale of the now-adjusted FIP to be on the same scale as RA9 rather than ERA. RA9 is Runs Allowed Per 9 Innings Pitched. This may sound advanced and unfamiliar, but it isn't. The MLB glossary defines it here. It is basically the allowed run average, rather than the earned run average. So, we're moving on a less familiar scale and removing the impact of fielder errors... interesting. FanGraphs finds the difference between the league average ERA and the league average RA9, and then adds that difference to our infield-fly-adjusted FIP. That looks like this:
Adjustment = lgRA9 – lgERA
FIPR9 = ifFIP + Adjustment
This gets us what FanGraphs calls FIPR9, which is just FIP but adjusted to include infield pop flies and to be on the same scale as the league average RA9.

FanGraphs then applies a park adjustment to FIPR9. It actually has a distinct park factor designed solely with FIP in mind. Why a different park factor is needed is beyond me, but presumably it only considers a park's effect on the adjusted FIP elements of HRs, Ks, BBs, HBPs, and infield flies, rather than all elements like the other types of hits. The pitcher's home park factor gets divided by 100, and then we divide his FIPR9 by that amount. That looks like this:
pFIPR9 = FIPR9 / (PF/100)
This gives us what FanGraphs pFIPR9, which is just the park adjusted FIPR9. Again, like with wRAA, park factors are applied since some parks are thought to be more conducive to allowing runs to be scored, and vice versa. The thought is that we don't want to penalize pitchers that play in parks like Coors Field where runs are scored more often. Pitchers with higher park factors (hitter-friendly parks) will have their pFIPR9 reduced relative to their FIPR9, and pitchers with lower park factors (pitcher-friendly parks) will have their pFIPR9 increased relative to their FIPR9.

Next, FanGraphs compared each pitcher's pFIPR9 to his league's average pFIPR9. Since it uses either the NL or AL average, a league adjustment is inherent in this calculation. This league adjustment and above average comparison is referred to as the RAAP9, for Runs Above Average Per 9 Innings. That adjustment looks like this:
Runs Above Average Per 9 (RAAP9) = AL or NL FIPR9 – pFIPR9

Up until now, pitcher WAR has been pretty straightforward, albeit flawed since it relies on FIP. Even though we made FIP better by considering infield pop flies, we still ignore other things like ground balls and any other type of hit besides a homer. Now things start to get more complicated with Dynamic Runs Per Win (dPRW). The belief is that different pitchers have different circumstances by which they need a different numbers of runs to win a game. We don't simply takes the RAAP9, compare to 'replacement level' and then divide by 10 or so to get the wins above replacement. Cause that would be too easy. The thought is that a pitcher has a direct influence on their run environment, so we can't use the league average Runs Per Win (batters impact their run environment too, but naturally we aren't consistent and consider this on that end too). FanGraphs uses this equation for its Dynamic Runs Per Win:

What an equation. There are 18 half-innings in an MLB game, and thus 18 recorded pitcher-innings. Our pitcher only pitched in a certain amount of those innings, measured by his innings pitched per game. So we do 18 - IP/G to see how many innings per game our pitcher didn't account for (and thus opponent and other teammate pitchers accounted for). We multiply that amount by the pitcher's league's average FIPR9, be it the AL or the NL. Then we add the portion of the innings that the pitcher did pitch in, multiplied by the pFIPR9. Note that there isn't a league average pFIPR9, since the league average park factor is just 100. So we basically have the left side being the league weighted average adjusted FIP, and the right side being the weighted average adjusted FIP for the pitcher, where the weights are based on the proportion of innings pitched. We divide by the total # of pitcher-innings per game, which again is 18. The left side are the Runs Per Pitcher-Inning attributable to other pitchers, and the right side are the Runs Per Pitcher-Inning attributable to our pitcher. Combining these two sides gives us a total Runs Per Pitching-Inning. When we divide by 18, we go from Runs Per Pitcher-Inning to Runs Per Game. Similar to what we did in the denominator for position player WAR, we add 2 and multiply by 1.5 to go from Runs Per Game to Runs Per Win.

Once you get a pitcher's dRPW, we combine it with their RAAP9 to get their Wins Per Game Above Average (WPGAA). That equation looks like this:
Wins Per Game Above Average (WPGAA) = RAAP9 / dRPW
So a pitcher's wins per game above average are his runs above average divided by his personal runs per win.

This gives us wins above average, but of course for WAR we want wins above replacement, so we must adjust using our replacement level. FanGraphs defines their pitcher replacement level using the below equation:
Replacement Level = 0.03*(1 – GS/G) + 0.12*(GS/G)
This equation accounts for positional differences between relievers and starters. The left side accounts for relievers, and the right side accounts for starters. GS are the # of games you started in (i.e. appeared as a starting pitcher), and G are the # of games you pitched in (i.e. appeared as a starting pitcher or a relief pitcher). So, G - GS are the # of games that you appeared as a relief pitcher in. We look at the % of games that a pitcher appeared in as a reliever vs as a starter. (G-GS)/G is equivalent to (G/G) - (GS/G), which equals 1 - GS/G. So you get .03 for your % of reliever games and .12 for your % of starter games. Basically, if you are solely a relief pitcher, the replacement level is .03, and if you're solely a starting pitcher then your replacement level is .12. However, if you do both, then this equation works to find the correct blend of replacement level. These values are the replacement level wins per game above average. It naturally isn't mentioned why the weights are .12 for starters and .03 for relievers, but I believe it goes back to the same .12 runs per batters faced that Tango mentioned in deriving the FIP weights. Looking back at our 2010 Reds, we see that Mike Leake (SP) allowed 77 runs and faced 604 batters, putting him at .127 runs per batter. Homer Bailey (also SP) allowed 55 runs and faced 465 batters, putting him at .118 runs per batter. The same logic doesn't work for relievers, but maybe the thinking is that since relievers pitch roughly around a quarter of the innings that a starter does, they get about 4 times less.

We can add this replacement level to WPGAA to get WPGAR, or Wins Per Game Above Replacement. That simple equation looks like this:
WPGAR = WPGAA + Replacement Level
This basically gives us WAR per game, so the seemingly final step is to adjust for the # of games that the pitcher played in to get their WAR. Instead of using G (games appeared in), we use a measure for complete games, which would be innings pitched divided by 9. That looks like this:
“WAR” = WPGAR * (IP/9)
FanGraphs calls this "WAR" because they still apply some more adjustments before being finished.

The main adjustment is Leverage and is something I disagree with. This is kind of like how position player Fielding Runs are adjusted based on their 'difficulty'. Leverage is the notion that some pitching appearances are higher leverage and thus more difficult. Relievers go through this scrutiny called 'chaining' where the logic is impended on them that if they go down, they aren't replaced by a AAA player like a starter would be, but rather by the next guy down in the bullpen. The closer wouldn't be replaced by a minor leaguer, but rather by the setup guy. The minor leaguer would still be called up, but would take the bottom spot in the bullpen. FanGraphs uses this Leverage Index Multiplier equation:
LI Multiplier = (1 + gmLI) / 2
almost WAR = "WAR" * LI Multiplier
The gmLI is the average Leverage Index for the pitcher when he enters the game. It varies by pitcher. Leverage is on a scale centered around 1, meaning a situation with a leverage of 1 is neutral. More difficult and higher leverage situations will have a leverage index greater than 1, and lower leverage situations that aren't as difficult will have a leverage index less than 1. The chaining effect essentially brings the player's leverage index closer to 1. If your gmLI was 1.2, then with the LI Multiplier your gmLI would be regressed to (1 + 1.2)/2 = 1.1. Again, this is done since your absence isn't as impactful because there are other bullpen arms that can fill your spot. You can read up on Leverage Index here. Not shockingly, it was created by Tom Tango. The LI Multiplier is then multiplied with our "WAR" to get what is *almost* our final pitcher WAR via FanGraphs. Note that starters don't deal with leverage, so their "WAR" is equal to their almost WAR. Also note that the 'difficult' measure of leverage depends on the inning, the # of outs, the # of runners on base, and the score of the game. I guess a better word to describe it would be importance rather than difficulty, since there's no type of adjustment if a reliever is facing Barry Bonds vs if he's facing Jim Abbott.

The final adjustment is assuring that the sum of all pitchers' WAR is equal to the 430 wins above replacement that FanGraphs has allotted to pitchers. Since the sums normally don't match up, a final adjustment is done across the board to all pitchers based on their innings pitched. We take the total WAR at that point and subtract it from 430 and then divide it by the total # of innings pitched to get WARIP, or WAR per inning pitched. This is then multiplied by each pitcher's specific # of innings that they pitched, as shown below:
Correction = WARIP * IP
WAR = almost WAR + Correction
This correction is added to the WAR we had so far to get the final WAR for pitchers. Note that the correction is generally negative.

That is it for FanGraphs WAR for pitchers. Fortunately, Baseball Reference's pitcher WAR is more in line with their other WAR and has fewer unique elements. They offer good descriptions of what they do, but don't share as many equations or work. You can read about Baseball Reference's details for pitcher WAR here.

They start with the pitcher's actual runs allowed (not earned runs) and their innings pitched. They then see how an average pitcher would have fared if they had pitched that # of innings. This is done using xRA, or Expected Runs Allowed. This is Baseball Reference's baseline pitcher metric for its pitcher WAR, like FIP is for FanGraphs. Pitchers on different teams and in different seasons will face varying quality in terms of the quality of the opposition they face. For each team since 1918, Baseball Reference knows that team's average runs per out, which they can then adjust using park factors. This lets us see the # of runs we would expect the average pitcher to allow, given the set of teams and parks that our pitcher in question faced. This process overall benefits pitchers that have to face great hitters more frequently (such as the '27 Yankees), and docks pitchers that face worse hitters.

One technicality in xRA is that they only include non-interleague games (i.e. only NL vs NL or AL vs AL matchups) or home interleague games. Basically, they think that AL teams will have skewed results for the few games they play each season without a DH. xRA logically gets weighted based on the pitcher's innings pitched against each team in question.

Thankfully, Baseball Reference does not use FIP to account for the quality of a pitcher's defense behind him. Instead, they start by finding the total DRS of his fielders, or the total TZR of his fielders if before 2003. They then divide the # of balls in play (BIP) 'allowed' by the pitcher by the # of balls in play 'allowed' by the defense. This proportion gets multiplied by the team's total defensive runs saved (or total zone rating). This is called xRA_def for the expected runs allowed given a certain defense and that equation looks like this:
xRA_def = (BIP_pitcher)/(BIP_team) * TeamDefensiveRunsSaved
Basically this equation looks at the % of a team's balls in play that were allowed by that pitcher, and attributes that same % of the team's total defensive runs saved to the pitcher. If 10% of all balls in play took place while you were pitching, and overall the team's defense saved say 20 runs, then we'd say that the team's defense saved 20*.1 = 2 of your runs. This amount will be used at the end to adjust the xRA. So thus far we know how many runs we'd expect an average pitcher to allow, given the teams/batters that you've faced and the quality of your fielders.

Baseball Reference uses a positional adjustment called xRA_sprp to adjust for the differences between starters and relievers. This is done because relievers generally have lower ERAs to starters, largely because they only have to face batters once and are able to exert more effort in the short-term rather than having to worry about longevity during the game. Baseball Reference uses an adjustment of .1125 runs per game from 1974 to present, an adjustment of .0583 runs per game from 1960 to 1973, and no adjustment prior to 1960. The adjustment varies over time due to differences in how relievers were used across baseball history. Here's an article from FiveThirtyEight showing the difference in ERA over time. We see that it used to be near 0 prior to 1970, so the smaller amounts back then check out, but for most of the time it hovers around a .3 to .4 difference, so I'm not sure why only a .1 adjustment was used.

The final adjustment piece for xRA is PPFp, which are custom park factors for pitchers. Rather than using the team's park factors, it goes even more specific and adjusts for the parks that the pitcher actually pitched in. Maybe a team plays at Coors Field often, but the pitcher never starts when they do and instead he normally starts at Oracle Park. He'll have a smaller park factor than his team would. Most of the time though, the pitcher's custom park factor will be very close to his team's park factor.

Baseball Reference combines these different adjustments to get a player's final xRA, using the equation below:
xRA_final = PPFp * (xRA - xRA_def + xRA_sprp)

Once we have the final xRA, then like usual we must convert our runs to wins. Baseball Reference discusses that here. It's a little more complicated and they don't give us all that much data to go off of, but again it's more or less equating about 10 runs to a win. Such a conversion brings us from the final xRA to WAA, or Wins Above Average.

Then similarly to what FanGraphs did, we make an adjustment based on leverage since as-is starters have much higher WAAs than relievers. Baseball Reference uses the exact same leverage multiplier equation and chaining process as FanGraphs does. You can read about Leverage Index here. Here's that leverage multiplier equation again:
WAA_adj = WAA * (1.00 + leverage_index_pitcher)/2
This will give higher quality relievers with a higher average leverage index a larger WAA than worse relievers that pitch in less important situations. By rewarding the better relievers in such a way, they get to an adjusted WAA that is better than average starters. Again, there's no leverage adjustment for any starters.
The last thing Baseball Reference does is factor all players' adjusted WAAs so that the sum of the WAAs across the league is 0.

The final piece of pitcher WAR is defining replacement level. As mentioned previously, Baseball Reference allots 410 wins above replacement to pitchers. They define the replacement-level pitcher's runs allowed per out as RpO_replacement, which is the league average runs allowed per out * (20.5 - 1.8)/100. The 20.5 is called the Replacement Level Multiplier, and represents the # of runs a replacement level player would score per 600 plate appearances. The 1.8 is defined as "an empirical factor that makes the final result mostly closely align the sum of all player replacement runs to the desired league total". We aren't told anything besides that or given any work to support the 1.8 amount, let alone the 20.5 figure. We get the final runs above replacement as runs_above_avg + RpO_replacement * Outs pitched.

We then convert that runs above replacement to wins above replacement to get WAR_rep. Combining this with other pieces gets us the final pitcher WAR equation:
WAR = WAR_rep + WAA + WAA_adj

Both sites also calculate what a pitcher's positional WAR would be, and then combine that with their pitcher WAR to get the player's total WAR. Unlike FanGraphs, Baseball Reference does have some notes about the positional adjustment for pitchers. The WAR for position players page mentions these pieces for pitchers here. All of the normal position player pieces of WAR get calculated for pitchers, just the positional adjustment is handled differently. As our chart way up above of batting stats by position showed, pitchers are generally terrible at batting. Because of this, and along with the idea that most teams don't pick pitchers solely for their batting ability, Baseball Reference sets all pitcher batting such that their WAR is 0 for that part. For the Pitcher Positional Adjustment, the Batting Runs (Rbat), Baserunning Runs (Rbr), and GIDP Runs (Rdp) of every pitcher is added to get the total runs for pitchers in the league, Runs_sum_lg. Then we find the league total plate appearances by pitchers, PA_sum_lg, and divide by this amount to get the average pitcher runs per plate appearance. Baseball Reference assumes about 600 PAs per season for players, and given that pitchers normally produce negative runs on the positional WAR side, they multiply the pitcher runs per PA by -600 to make it a positive value. This positive value per PA is then multiplied by a pitcher's actual PAs to get their positional adjustment. So essentially for the adjustment, pitchers that simply bat more will get a larger adjustment; they can impact their adjustment to the extent that a single pitcher can alter the average performance of all pitchers (which is difficult to do).

Baseball Reference combines a pitcher's pitcher WAR with his positional WAR to get his total WAR. Madison Bumgarner had 1.2 position player WAR in 2014, and 3.7 pitcher WAR that season, for a total of 4.9 WAR. Shohei Ohtani had 4.9 position player WAR in 2021, and 4.1 pitcher WAR that season, for a total of 9.0 WAR, the most of any player last year. Since these 2 pieces of WAR are measured differently, combining them can be tricky, as can comparing them. If a position player had a WAR of 9 but a pitcher had a total WAR of 9, I'm not sure it would be accurate to say the two players are equivalent. The WARs are measured differently. However, if two pitchers had the same pitcher WAR of 4, but one had a positional WAR of 1.5 and the other had a positional WAR of 0, we can accurately conclude that we'd prefer the pitcher with the higher total WAR. Of course, given the universal DH these days, the importance of a pitcher's positional WAR has dwindled...

Let's summarize WAR for Pitchers

FanGraphs entirely uses FIP, which only cares about HRs, Ks, BBs, and HBPs, but for the purposes of WAR they also include infield fly balls. FIP does use the run values of these events, but scales them per IP rather than per occurrence.
They then adjust FIP using park factors, compare it to league average (AL or NL), convert from runs above average to wins above average (using a dynamic runs per win conversion), and then factor in replacement level to get Wins Above Replacement for pitchers. They also apply a leverage adjustment to increase the value of relief pitchers.
Baseball Reference uses xRA which considers the quality of batters the pitcher faces, and adjusts it for the quality of the pitcher's team's defense, as well as the park factors for the parks the pitcher actually pitches in.
They then apply a positional adjustment for starters and relievers and also use a leverage adjustment to increase the value of relief pitchers. From there they also convert the runs to wins, and factor in replacement level to get to WAR.
Both sites also compute the position player WAR for all pitchers, so that we can add a pitcher's 2 measures of WAR to get their total WAR and thus overall value.

The last step is knowing what makes a good WAR. A WAR of 0 means the player has no wins above replacement, so they are a replacement level player. A replacement level player shouldn't persist on a team and ought to be removed, and especially so if the player is in fact below replacement level (negative WAR). Any WAR less than 1 is really grouped into replacement level. A WAR of 1 to 2 means the player should be a bench role player. Most starters should have a WAR of 2 to 4, with decent starters being between 2 and 3, and better starters being between 3 and 4. Our high-quality starters that can make the All-Star game will have a WAR between 4 and 5. Our superstar players that will likely start in the All-Star game and win various accolades will have a WAR between 5 and 6. Lastly, any player with a WAR above 6 is MVP caliber and should receive some votes for that award.

Clearly my longest post by far, but that does it! I hope that I have been able to at least somewhat explain WAR in a helpful way and increase your understanding of its calculation, while also pointing out how complex it is, the lack of data and evidence provided to us, and the ways in which it can improve. I encourage everyone to dig into the flurry of links that I've included to get a better grasp of WAR and help answer any questions. I give credit to the people that have developed WAR. It is certainly a better overall measure of player value than many other baseline stats out there, and the idea of having a singular number to look at is appealing. I think that for the most part, the calculation of WAR is sound, but there are points of contention and there is a fundamental lack of transparency. If both sites would just include the data they use to calculate these pieces and spend more time explaining WAR and showing more of their work, then I think myself and many others would be more convinced. I think the thought process behind Baseball Reference's calculation of WAR is more appealing, but FanGraphs does a better job of explaining their process in a more readable format and providing us with equations, etc.

Here's a quick summary of my disagreements with WAR:

wOBA is pretty solid; my main beef is with using different weights each year. In my opinion, weights should only change to the extent that new data is added to the existing data. I don't think the value of a single changes from year to year, but rather that there is an intrinsic value of a single based on all the data we have over time. As more data comes, we can continuously refine and get closer to what that intrinsic value actually is. Why only look at one year and limit your sample size? Instead of saying a single is worth x in 2021 and y in 2022, say that a single is worth x from 1920-2021, and then update with the new data if needed to say that a single is worth y from 1920-2022.
I disagree for now with park factors and league adjustments. I think the leagues are more or less equal and that just because a certain park scores 5% more runs than average (or even has 5% more home runs than average), we shouldn't adjust a player's real HR mark by 5%. Such adjustments are fine for 'what if' analysis, but to seriously implement them is a big assumption and comes with some major implications.
I think the baserunning metrics are sound and like that they are based on run values, but holy moly show us some of the work and provide more of the data. Baseball Reference is slightly better for having the extra piece of being able to beat out ground balls and thus avoid ground ball double plays.
Nowhere near enough work is shown by Baseball Reference (or other sources) about how TZR and DRS are actually calculated, and switching up the metric used by era is ridiculous. How can we pretend to measure 2 players that played in vastly different eras using one metric, when how we calculate that metric over time is changing? *This is still an issue with FanGraphs, even with the improved OAA metric.
For FanGraphs, the difficulty adjustments of UBR are not in line with wOBA, and we need consistency. If we apply difficulty to fielding, we need to do so for hitting, and if we don't apply difficult to hitting, then we shouldn't for fielding.
We can fix positional adjustments more easily by just comparing to positional averages rather than league averages. Adjusting by a constant amount of x runs for a position just seems more messy. I'd argue that being above average at any position is equal. You're paid to play that position, and if you can do it far better than your peers, you should get the same amount of credit regardless of position. If you are going to apply some positional adjustment, you should factor in hitting as well; a good hitting second basemen is valuable and shouldn't be adulterated by being compared to all other hitters.
Replacement level isn't a fundamental mathematical concept based on data like average or a first quartile is. Using one of these would still get the job done and be more sound, in my opinion. I don't think using a value that you literally have to plug-and-chug into for a baseline is sound. We don't need some arbitrary (even if somewhat mathematically derived) baseline to say a player below this level needs to be replaced. We can instead say that below average players should try to be replaced, but especially players in the bottom 10% or 25% of their peers should definitely be replaced. I prefer to think about replacing players from an off-season perspective rather than from an in-the-season perspective.
The currency of baseball is runs, not wins. Adjusting runs to wins is unnecessary and not meaningful. Players can individually score runs but can't individually win games. A team wins a baseball game by having more runs than the other team. The value of a run more or less doesn't change over time, since it's all comparative. A solo walk-off HR will always score 1 run and will win the game regardless if the teams both had just 1 run or if the teams both had 22 runs.
Ignoring other parts of catcher fielding and only focusing on their blocking and throwing doesn't make sense to me.
Ignoring pitchers giving up many non-HR hits (such as a ton of doubles) and their ability to force players to hit grounders doesn't make sense to me.
WAR should mainly be used to compare players in a given season (who should win the MVP?), or at least in a similar era (who was the best player of the 2000s?). WAR should not be used compare players across different eras (should a player be in the Hall of Fame, based on his WAR? Who is the greatest player of all-time?).

I think those were the main ones, but I'm sure I complained about other parts throughout the post. As I've alluded to, I will plan to introduce my own measure of player value for my next post. It won't be simple per se, but much simpler relative to WAR and I will actually explain all of my steps and thoughts while showing all of my work. The batting piece will be very similar. I have similar weights as Tango got with wOBA, albeit via a different approach. The baserunning piece will be similar for base stealing, but I just don't have the data for advancing bases or beating out grounders to be able to include those. The fielding and pitching will be fundamentally different and much closer in thinking and implementation as the batting. We won't convert runs to wins, and we won't compare to replacement level. I look forward to sharing it with you all. Thanks for reading about WAR, and as always let me know if you have any questions in the comments!

Statting Lineup Newsletter Signup Form:
If you'd like to receive email updates for each new post that I make, sign up for the Statting Lineup newsletter using the link below:
https://weebly.us18.list-manage.com/subscribe?u=ab653f474b2ced9091eb248b1&id=3a60f3b85f

1 Comment

Statting Lineup
Blog Posts

WAR, wOBA, UZR, What Is It Good For? Absolutely... Something? (Defining Statistics 2)

Let's See the Readers' Stances on WAR:

Defining Statistics 1: Taking a Look at 12 of Baseball's Most Popular Metrics

Now It's Time For The Readers To Decide:

Statting Lineup Newsletter Signup Form:
If you'd like to receive email updates for each new post that I make, sign up for the Statting Lineup newsletter using the link below:
https://weebly.us18.list-manage.com/subscribe?u=ab653f474b2ced9091eb248b1&id=3a60f3b85f

Author:

Blog Categories

Archives

Statting Lineup​Blog Posts

WAR, wOBA, UZR, What Is It Good For? Absolutely... Something? (Defining Statistics 2)

Let's See the Readers' Stances on WAR:

Defining Statistics 1: Taking a Look at 12 of Baseball's Most Popular Metrics

Now It's Time For The Readers To Decide:

Statting Lineup Newsletter Signup Form:If you'd like to receive email updates for each new post that I make, sign up for the Statting Lineup newsletter using the link below:https://weebly.us18.list-manage.com/subscribe?u=ab653f474b2ced9091eb248b1&id=3a60f3b85f

Author:

Blog Categories

Archives

Statting Lineup
Blog Posts

Statting Lineup Newsletter Signup Form:
If you'd like to receive email updates for each new post that I make, sign up for the Statting Lineup newsletter using the link below:
https://weebly.us18.list-manage.com/subscribe?u=ab653f474b2ced9091eb248b1&id=3a60f3b85f