No other talent management system has been the subject of such great debate, change, and emotion as performance management (PM). Many strategies have been attempted to extract value from PM processes, ranging from simple rating scale changes to complex behavior change initiatives. Although many of these have seemed promising initially, they have yielded disappointing outcomes and significant dissatisfaction with PM processes in organizations. The challenges inherent in PM coupled with many unsuccessful attempts to fix it have given rise to what are viewed as trendy PM practices and a well-earned reputation as the Achilles’ heel of talent management practice (Pulakos et al. 2012). No matter what has been tried over decades to improve PM processes, they continue to generate inaccurate information and do virtually nothing to drive performance.
PM is challenging because it is a complex, multifaceted, and multilevel process that draws on theory and research from many different areas, including measurement theory and motivation theory; cognitive, clinical, social, and behavioral psychology; neuroscience; organizational development; and change management. It has been heavily influenced by practice as well, with business leaders and PM practitioners offering rating schemes and evaluation strategies they believe will drive higher performance. Research has generally focused on specific aspects of the PM process, such as the effects of different types of rating scales on ratings or the role of human information processing in evaluating others (DeNisi & Murphy 2017). Only a few studies have focused on the impact of different PM practices on performance and which features play the greatest role in driving business outcomes; thus, we have relatively few evidence-based insights about the impact and return on investment (ROI) of PM in organizations.
Over time, PM has become increasingly complex, requiring many hours of manager and employee time and costing organizations millions annually. CEB (2012) estimated that the average manager and employee spend 210 and 40 hours, respectively, on PM activities, which translated into costs of 30 million USD annually for a company of 10,000 people. As another example, Deloitte found that they were spending 2 million hours annually on PM activities (Buckingham & Goodall 2015). Not only have time investments and costs skyrocketed, but complaints have become increasingly vocal and emotional, especially concerning performance reviews. Culbert & Rout (2010) described the performance review as a “pretentious, bogus practice” that should be put out of its misery. A Washington Post headline read, “Study finds that basically every single person hates performance reviews” ( https://www.washingtonpost.com/news/on-leadership/wp/2014/01/27/study-finds-that-basically-every-single-person-hates-performance-reviews/?utm_term=.03749c8f8b7d ).
Most concerning, however, is that the few studies that have evaluated the impact of PM processes on performance and business outcomes have shown virtually no positive impacts. For example, based on 23,339 performance ratings from 40 organizations, CEB (2012) found that business units with highly rated employees were no more likely to be profitable than those with low-rated employees. So much cost and time, yielding so much dissatisfaction with no discernable performance impact, has led to a recent wave of sweeping, revolutionary reform, and experimentation with new-in-kind PM practices. These range from greatly simplifying or even eliminating formal PM processes to driving behavior changes that research has shown positively impact performance, such as providing regular informal feedback and setting agile, shorter-term goals (Mueller-Hanson & Pulakos 2018).
In this article, we briefly review the evolution of PM, which began with a much narrower focus on performance ratings. We trace its development from research evaluating the impact of rating format and training on ratings to that aimed at understanding how human information processing, rater-ratee interpersonal relationships, and political and contextual factors affect ratings. We then discuss how performance evaluation evolved into more comprehensive PM processes that included goal-setting, formal feedback, multirater reviews, etc., and we discuss how these practices have become misguided over time. Finally, we discuss current directions in PM, taking stock of what we have learned to date and suggesting directions for the future.
The early history of PM was focused on performance evaluation, the goal of which was to obtain accurate ratings of individual performance. The first large-scale use of ratings in work settings dates back to the late 1800s, with use of efficiency ratings in the US Federal Civil Service (Lopez 1968) and trait assessments (e.g., punctual, assertive) of officer performance during World War I (Scott et al. 1941). The first rating scale, the Graphic Rating Scale (Patterson 1922), used verbal and numerical anchors to improve the accuracy of trait ratings. Although this was a significant step, the anchors used were ill-defined (e.g., “Excellent,” “Good,” or “Poor”), leaving raters to impose their own interpretations on what these anchors meant (Landy & Farr 1980, Borman 1977). Raters applying their own idiosyncratic standards to defining different rating levels remains a persistent challenge today.
The emergence of scientific management theories in the early twentieth century (Taylor 1911) led to an increased focus on productivity and the corresponding use of ratings to control and drive higher performance (Grote 1996, Murphy & Cleveland 1995). The civil rights movement of the 1950s and 1960s brought attention to inequalities based on race and prompted more rigorous evaluation practices in organizations. The Civil Rights Act of 1964 and subsequent legislation prohibited discrimination in employment practices, prompting extensive work in the area of rating format design to ensure ratings were based on job-relevant factors and to mitigate bias (Dunnette 1963, Guion 1961). One idea that gained popularity was to anchor different rating levels with work behaviors to help managers match their observations of employee performance to an appropriate rating level (Smith & Kendall 1963, Blanz & Ghiselli 1972, Latham & Wexley 1977). Many variants of behavioral rating formats were designed and evaluated over the next 20 years, until Landy & Farr (1980) called for a moratorium on rating format research, concluding that no rating format yielded substantially more accurate or less biased ratings than any others (Murphy et al. 1982, Saal & Landy 1977).
Although rating format research largely ceased with Landy & Farr's (1980) moratorium, a new forced choice rating format was introduced in the early 2000s that has been shown to yield improved rating reliability, validity, and accuracy (Borman et al. 2001, Bartram 2007, Schneider et al. 2003). This format asks managers to choose which behavior is most true (or most and least true) of each employee's job performance from a set of equally desirable behaviors. Using item response theory (IRT) information for each item, raters’ judgments are converted to an interval scale; specifically, choosing one behavioral statement over the others provides information about the placement of each employee on the underlying dimension at the interval-scale level. Although research has shown this format to yield higher quality ratings, its adoption has been rare in practice. One reason is that advanced IRT concepts are difficult to explain. Forced choice formats also require large item banks with associated item parameters that can be prohibitive for organizations to develop and maintain. Finally, the main advantage of forced choice ratings is also likely its main disadvantage, namely that managers cannot easily manipulate their ratings to ensure employees receive certain reward outcomes; hence, forced-choice scales are not well received by managers.
A parallel path to improve ratings focused on rater training (Borman 1975, Latham et al. 1975). On the basis of the assumption that ratings are normally distributed, training programs were developed to teach raters to avoid common rating errors that would result in non-normally distributed ratings, such as leniency (most employees are rated at the high end of the scale). To reduce leniency, for example, raters were taught that most employees should be rated in the middle of the scale and equal but smaller proportions should be rated at each of the high and low ends. Subsequent research showed that error training did not increase accuracy and may actually reduce it (Murphy et al. 1993). Years later, O'Boyle & Aguinis (2012) provided evidence that performance is not normally distributed in many cases, which explains why training to produce normally distributed ratings would decrease accuracy.
A paradigm shift occurred in the early 1980s that influenced performance rating research for the next two decades. Landy & Farr (1980) argued that more holistic theories were needed to understand the interactive effects of different factors on ratings, and they proposed the use of human information-processing theories and models to guide future research. Ratings were conceptualized as a special case of human information processing that includes attention, categorization, recall, and information integration (Feldman 1981). Extensive research leveraged information-processing theories to understand rating behavior and develop interventions to improve rating accuracy. These focused on helping raters develop and use job-relevant mental categories in observing and evaluating employee performance (Ilgen & Feldman 1983). For example, rater training shifted from a focus on reducing rating errors (e.g., halo, leniency) (Cooper 1981, Murphy & Balzer 1989, Murphy et al. 1993) to helping raters create job-relevant mental categories that would direct their attention to relevant performance information and store it with related performance information to facilitate accurate recall (McIntyre et al. 1984; Pulakos 1984, 1986). Although information processing theories provided insights into the mental processes that impact how ratings are made, this research yielded few practical implications for evaluating performance more effectively in organizations.
Mounting concerns over discrimination and legal challenges in the 1970s and 1980s brought implementation of more structured evaluation processes. For example, management by objectives (MBO; Drucker 1954) provided a way to define, communicate, and evaluate employees against job-relevant performance objectives. Although MBO systems were widely adopted, they were eventually abandoned because they proved to be time-consuming and administratively burdensome for their value (Jamieson 1973, Strauss 1972). However, ideas stemming from MBO, such as setting objectives and measuring results, remain a common feature in PM processes today.
A popular rating method to emerge in the early 1980s was the forced distribution, introduced by former General Electric CEO Jack Welch. Known as GE's “rank and yank” system, employees were slotted into categories based on how their performance stacked up to other employees', with small proportions (10–15%) identified as top and bottom performers and the remaining ∼80% slotted in the middle. The top and bottom groups often defined those to be promoted and separated, respectively. The practical problem forced distributions posed is that the top 10% in a low performing group may be performing with the same effectiveness as the bottom 10% in a high performing group, introducing both fairness and accuracy concerns if the groups are blindly combined. This issue is typically mitigated through calibration sessions in which employees are discussed and recategorized to ensure that the top and bottom 10% are accurately identified across all employees. However, this is a time-consuming process that becomes less informed as calibration rolls up through higher organizational levels and individual employee performance at lower levels becomes less well known. Although forced rankings remained popular for more than 30 years, their use is now on the decline, falling from 49% in 2009 to 14% in 2011—GE being among those to abandon this rating method (i4cp 2011).
Another rating strategy that emerged about this same time was gathering multisource or 360-degree ratings from peers, customers, or direct reports in addition to managers. The idea was that those with different role relationships to an employee observe different aspects of performance (Borman 1974). For example, customers will have unique insights into one's customer service effectiveness, whereas direct reports will be best equipped to evaluate a manager's feedback and mentoring performance. 360-degree ratings gained popularity in the 1980s and are still widely used today (Bracken et al. 2001, Smither et al. 2005). They are primarily used to provide developmental feedback, but they can also support decision making, if the rating information from the different sources is appropriately integrated and interpreted by the person's manager or a coach (Bracken et al. 2001). One caveat, however, is that decrements in the quality of multisource ratings are often observed when they are used for decision making versus development only (Greguras et al. 2003).
Performance Evaluation ChallengesUnderlying performance rating research are three assumptions that are worthy of further exploration:
- ▪ Everyone has a stable level of true performance that reflects their effectiveness on the job.
- ▪ Raters are able to rate others accurately.
- ▪ Raters are motivated to rate others accurately.
Regarding the first, we assume that each individual has a “true” level of performance that they consistently exhibit on the job. We then use the extent to which different raters agree on their ratings of an individual as an indicator of how well the ratings are capturing a person's true performance level, with higher agreement giving us more confidence we are accurately measuring the person's true performance level. Rater agreement is hard to achieve, however, in part because raters bring their own standards to any rating situation, based on their past experience, personal rating tendencies, and idiosyncratic views about what constitutes good or poor performance (Landy & Farr 1980, Feldman 1981). However, rating disagreement can also stem from raters viewing different aspects of performance or, importantly, real differences in how employees actually behave in the presence of different raters. An individual may be highly responsive with managers but disregard peers—or the person may help only some peers but not others. These realities explain why interrater reliabilities are typically only in the.50 range (Viswesvaran et al. 1996), and they also raise questions about the extent to which true performance can be agreed upon among raters, or even exists.
The second assumption is that raters can make accurate ratings with proper rating instruments and training. The reality is that most managers can identify who is doing a job capably, who is failing, and who is performing above and beyond. However, the ratings managers are asked to make are sometimes so nuanced and detailed that they are beyond their information-processing capabilities. Managers see thousands of performance examples in a year-long rating cycle—far too many to recall, weight, and summarize with a high degree of accuracy for each employee. Furthermore, managers do not see performance in some areas (e.g., how employees engage with their direct reports) and may not have the subject matter expertise to judge some of what they do see (e.g., general managers rating highly technical performance), causing them to rely on biased impressions, what others say, or stand-out examples of obviously exceptional or poor performance (Landy & Farr 1980). Rating scales that contain many rating levels or factors that require highly nuanced judgments are asking for rating precision that managers cannot realistically provide (Pulakos & O'Leary 2010). Many overengineered rating formats and processes have been developed that do not align well with raters’ information-processing capabilities.
The third assumption is that raters are motivated to evaluate others accurately. However, several studies question this assumption by showing that various contextual factors undermine rating accuracy (Tziner & Murphy 1999, Murphy et al. 2004). Murphy & Cleveland (1995) suggested four competing goals that managers must negotiate and balance when they evaluate employees:
- ▪ Task performance goals, which entail using ratings to influence subsequent performance.
- ▪ Interpersonal goals, which entail using ratings to maintain or improve relationships with employees.
- ▪ Strategic goals, which entail using ratings to increase the manager's or workgroup's standing in the organization.
- ▪ Internalized goals, which reflect raters’ personal beliefs about how they should evaluate performance.
It has been proposed that political, social, and practical factors carry so much weight in managers’ rating behavior that rating accuracy and employee differentiation are simply not relevant drivers of ratings (Adler et al. 2016). It has similarly been argued that managers have few if any incentives to rate employees accurately (Pulakos & O'Leary 2011). Unless an employee is a problem performer, many managers take the pragmatic approach of playing to employees’ strengths and assigning them work they can do well (Mueller-Hanson & Pulakos 2018). They realize that all employees are imperfect and each brings different capabilities to a job. If employees—especially experienced employees—are making solid contributions, managers often overlook weaker areas for which there is little chance of change or growth. Experienced managers understand the practical realities and costs of replacing staff and onboarding new staff that will also be imperfect. Finally, they understand that employees want to be recognized and praised, which creates a strong incentive for them to rate their key, albeit imperfect, employees above the midpoint of any scale in order to keep them motivated and engaged.
Managers also take a pragmatic approach to how they use ratings in pay and reward decisions (Pulakos et al. 2015). Instead of using the rating process to arrive at an evaluation and then translate this into a pay decision, managers are more likely to retrofit their ratings to align with the reward decisions they want to make at a given point in time. The practical considerations that drive pay decisions include mitigating attrition risk, managing internal or external equity, and even whose turn it is to get a larger increase—a phenomenon that results from the relatively small (2–3%) raise pools that most organizations have today. To the extent that ratings align with pay increases, it is often the latter driving the former.
Given that a manager's job is to get the highest performance out of his or her team, a key part of the job is using all available levers to keep the collective group engaged, productive, and performing. Managers also have their own motivations and advancement goals that the perception of a high performing and engaged team helps them achieve. Although rating everyone at the high end of a rating scale may not yield accurate evaluations, it can be argued that this is rational behavior, especially when today's managers are being asked to do more with less, they want their teams and themselves to look good, and they want access to rewards and future opportunities (Mueller-Hanson & Pulakos 2018). Context factors thus have profound impacts on ratings and their implications need to be better understood and accounted for in the design of PM processes.
Summary and Next Steps for Performance Evaluation Research and PracticeWhat we have learned over decades of research and practice is that performance ratings bring significant challenges. Employees behave differently with different raters due to role and relationship differences. Managers bring their own standards, levels of sophistication, and expertise to evaluating others. They are swayed by their own biases, differences in the quality of their relationships with different employees, and their rating preferences. They also see only a slice of each person's work behavior and may or may not have the expertise to accurately evaluate what they see. These factors make true performance impossible to define and rating accuracy impossible to evaluate. The cognitive processes humans naturally use to process performance information leave raters with summarized impressions rather than detailed performance information. Although heavy requirements for precise and nuanced ratings may inspire more confidence that we are closer to ground truth about an employee's performance, this is a false sense of confidence, because there is no evidence that more complex ratings improve accuracy or fairness. An important consideration is how much rating differentiation and accuracy is actually needed, especially when ratings have been shown to have no impact on performance. Unless highly nuanced differentiation is required to distribute significant rewards, complex rating processes are unlikely to yield ROI commensurate with their costs. Simpler judgments that align with the overall judgments raters naturally make are likely sufficient and most practical for the majority of evaluation needs (see sidebar Summary: Performance Evaluation).
SUMMARY: PERFORMANCE EVALUATION Below is a summary of our current conclusions regarding performance evaluation based on research and practice to date: ▪ Ratings are inherently limited in their value as performance measures. ▪ Rater-ratee relationship differences yield actual performance differences, which raises questions about whether a “true” performance level exists that can be reliably captured across raters. ▪ Raters can accurately place others into general categories but cannot make nuanced performance judgments accurately. ▪ Political and social factors have very strong impacts on ratings. ▪ Properly selected, performance measures beyond ratings may mitigate challenges with ratings. |
Given the challenges inherent in ratings, performance evaluation can benefit from leveraging measures other than ratings. In some jobs, a great deal of performance information is readily available beyond ratings, such as customer surveys, sales data, production data, efficiency indices, and billable hours. Although these measures also have limitations (e.g., they can be deficient or contaminated), they can provide a more well-rounded picture of performance that goes beyond ratings. When collected on a regular basis, such measures can be used to signal performance issues early and drive real-time feedback to course-correct. An added benefit is that nonrating measures lessen the pressure on managers as their role can shift from judging employees to helping them understand and respond to different performance measures. The use of multisource ratings combined with attention to performance measures that exist in the environment may reduce the impact of political and social factors on ratings. With the level of digital transformation that is occurring in organizations coupled with the increasing focus on analytics, we will see increasing availability of and focus on performance measures that are automatically and frequently generated. The questions for future performance evaluation research are (a) how to leverage and combine available performance information (e.g., various metrics, ratings, etc.) data into meaningful, sensible, valid, and fair performance assessments and (b) what role humans will play in future performance evaluation processes. The answers to these questions will almost certainly result in new performance measurement strategies and practices that may mitigate the limitations of ratings but will also need to be carefully evaluated for their potential consequences.
With flatter, leaner organizations and pressure to do more with less, performance evaluation eventually evolved into more comprehensive PM processes that included a fuller array of activities to drive performance, such as cascading goals, expectation setting, and interim feedback reviews (Smither & London 2009, London & Mone 2014). These processes became fairly standard over the past 15–20 years, especially as organizations began acquiring automated PM systems to improve its efficiency (Aguinis 2013, London & Mone 2014). In these systems, employees are usually evaluated on behavior and results (Pulakos 2009). The idea is that both “how” employees perform (behaviors) and “what” they deliver (results) are considered important aspects of their performance. Behavioral ratings are more useful for course-correcting than results, which come after the fact. Behavioral ratings are notoriously attenuated, however, with most people rated above the midpoint of the rating scale, reducing their usefulness (Pulakos 2009). Results capture what some argue is most important—the outcomes one achieves—although results measures can break down when goal attainment is outside an employee's control or results from team performance rather than individual performance, which is often the case today (Locke & Latham 1990, Ployhart et al. 2009). A typical automated PM process is shown in Figure 1 .
Figure 1 Typical performance management process used in organizations.