By: Kate Bauer,
John Henry, well, he told his captain /
“Captain, a man, he ain’t nothing but a man /
Before I let your steam drill /
beat me down /
I’m gonna die with a hammer /
in my hand, Lord, Lord /
I’ll die with a hammer in my hand”
John Henry cut through a mountain, exerting extraordinary effort to beat the steam drill. The price was steep, costing Henry his life. Today new technologies providing reliable, cost-effective alternatives to routine human tasks seem to crop up daily. Innovators wield data to justify adopting the new technology; once adopted, technology’s capabilities eventually become the standard against which human efforts are measured. Even superb workers struggle to compete with machines that do not eat, do not sleep, and do not tire.
John Henry’s story is currently playing out in the eDiscovery field. When identifying responsive documents for production, responding parties have a duty under Rule 26(g) to make a reasonable inquiry. Requirements for a reasonable inquiry have been evolving in recent years. In the past, having humans manually review the documents was assumed to be accurate, with courts deferring to an attorney’s professional judgment that he had made a reasonable inquiry unless the requesting party could show a deficiency. The increasing use of technology-assisted review (TAR)—computer algorithms that analyze relevance decisions humans make on a small set of documents, and then extrapolate those decisions to the document collection—has led to an increased emphasis on statistically validating the quality of the production set up front. These validation requirements were initially intended to objectively verify the accuracy of TAR algorithms, ensuring their results constituted a reasonable inquiry. Now, at least one court has proposed requiring statistical validation of manual review as well, with threshold recommendations for what values constitute a “high-quality” review. While research indicates that TAR can probably meet these requirements with ease, the accuracy demands will likely tax the capabilities of human reviewers.
Humans, meet the steam drill. Good luck keeping up.
Reasonable Inquiry: Historically, a Deferential Standard
The Rules require an attorney to conduct a reasonable inquiry when putting together a production set. A reasonable production must be substantially complete, as “[a]n evasive or incomplete disclosure, answer, or response must be treated as a failure to disclose, answer, or respond.” To determine whether an inquiry is reasonable, the court looks to the totality of the circumstances. The most common remedy for an inadequate production is a motion to compel, with sanctions generally reserved for when an attorney has failed to discover the obvious, when a party fails to timely disclose relevant information, or when a party obstructs access to relevant information. Absent a showing of deficiency by the requesting party, courts treated a responding party’s assertion of reasonableness deferentially.
To meet the “reasonable inquiry” requirement, it was standard to hire armies of review attorneys to examine potentially relevant documents for privilege and production. Although utilizing massive review teams was generally considered reasonable if the team was adequately staffed and trained, the exhaustive review approach became increasingly unsustainable from a cost perspective as document volumes continued to grow.
TAR: Reasonable Inquiry Requires Verification
At the turn of the 21st century, document discovery was an area ripe for innovation. The document review process was painfully expensive: a 1998 study of federal cases found that discovery amounted to at least half of all litigation expenses, and discovery represented 90% of total litigation costs in the most expensive cases. By 2014, most Fortune 1000 corporations were spending between $5 million and $10 million annually on eDiscovery costs, with seventy percent of those costs spent on document review. Several companies reported spending as much as $30 million on eDiscovery.
Innovators developed technology-assisted review to address the increasing data volume. These computer algorithms promised to drastically reduce the time and cost of document discovery by “learning” from attorney decisions regarding relevance for a subset of documents, then applying the same designations to similar documents in the review set. Attorneys initially regarded these machine-learning claims with skepticism, so TAR advocates plied statistics to make the case that TAR was at least as accurate as exhaustive human review, and possibly more so. On top of its documented accuracy, on average TAR saved parties 45% of the costs of a traditional document review. Like the steam drill operators of old, small teams of specialists using TAR could replace the armies of human reviewers laying eyes on every document.
Today TAR advocates have largely succeeded: numerous courts have embraced TAR usage, albeit usually with caveats. In contrast to the historical deference to human judgment, when TAR was first adopted courts had concerns about its accuracy. Though academic studies had demonstrated impressive results for certain TAR solutions, the wide variety of solutions purporting to be TAR created fears about the “black box” of the technology. That one TAR solution worked did not mean a different one would. Courts were therefore interested in objective statistical measurements such as precision and recall that would help them in determining whether the TAR process had worked (i.e. whether it was reasonable). Statistical validation allows parties to determine whether TAR is making good predictions or bad ones without verifying TAR’s predictions for every document in the document set.
To ensure TAR was reasonable, courts encouraged parties to be more transparent with TAR productions. Rather than accepting an attorney’s representation that he had conducted a reasonable inquiry, courts advocated (and at times required) that parties disclose unprecedented access detailing exactly how accurate their review had been. This approach, though understandable, broke with the deferential approach courts had historically taken to manual review.
Reevaluating Reasonable Inquiry: From Deference to Verification for Manual Review
The same statistics that supported TAR’s superiority also exposed an uncomfortable truth: exhaustive human review is not the “gold standard” it had previously been assumed to be. In fact, research shows that manual human review is fraught with inconsistency: information retrieval professionals disagree about whether a document is relevant at least as often as they agree. Attorneys fare even worse: in one study, attorneys agreed on relevance, at best, a mere 28.1% of the time. Alternative methods to reduce document volume such as using search terms, or “keywords,” also have significant shortcomings.
Nevertheless, before TAR’s emphasis on statistical validation, little effort was expended to measure the accuracy of attorney coding decisions. Absent a showing to the contrary, courts accepted an attorney’s professional judgment that a review was complete, which was embodied in the Rule 26(g) certification. A requesting party who disagreed had the burden of showing that the responding party’s production was inadequate before a motion to compel would be granted.
The widespread usage of statistics to validate machine predictions has caused some to suggest applying a similar approach to manual review. As TAR became more mainstream, so too did awareness about the statistical metrics necessary to evaluate its performance. The Rule 26(g) reasonableness inquiry is an objective one: what could be more objective than statistical validation?
Special Master Maura Grossman recently ordered parties to apply statistical validation to manual review in In Re Broiler Chicken Antitrust Litigation, a class action currently pending in the Northern District of Illinois. Anticipating document-intensive discovery, Magistrate Judge Jeffrey Gilbert brought in Grossman, a well-known TAR expert, to arbitrate eDiscovery disputes. On January 3, 2018, Grossman issued a comprehensive Order governing processing, search methods, and validation. The Order observes that “[t]he review process should incorporate quality-control and quality-assurance procedures to ensure a reasonable production consistent with the requirements of Federal Rule of Civil Procedure 26(g).” Uniquely, the Order requires parties to use a Subject Matter Expert (SME) to calculate recall regardless of whether the parties use manual review or TAR. While recognizing that recall alone is not dispositive, the Order states that “a recall estimate on the order of 70% to 80% is consistent with, but not the sole indicator of, an adequate (i.e., high-quality) review.”
The Broiler Chicken validation provisions weigh heavily in favor of choosing TAR over manual review. Research shows that manual review tends to produce a wide range of recall values. For example, in their influential 2011 JOLT study Maura Grossman and Gordon Cormack compared the results of a TAR methodology against a manual review for five different review topics. While the TAR algorithms averaged a recall of 78.7% across all five categories (ranging from 67.3% to 86.5%), the human reviewers finished with an average recall of 59.3% (ranging from 25.2% to 79.9%). Similarly, in another study, researchers found that the recall of manual review ranged from 52.8% to 83.6%. Contrasting the wide range of manual review recall values with the tight band of TAR recall values supports the conclusion that manual review runs a greater risk of failing the Broiler Chicken recall validation protocol.
In addition to the increased risk of failing the Broiler Chicken validation protocol, manual human review involves a costly level of effort. While Rule 1 asserts that the Rules “should be construed, administered, and employed . . . to secure the just, speedy, and inexpensive determination of every action and proceeding,” purely human efforts generally involve greater time and expense than TAR. As shown by the JOLT study, the increased time and expense of human review is unlikely to translate into greater accuracy than TAR provides.
Electing human review over TAR will usually run afoul of Rule 1’s objectives to increase the speedy determination of actions while reducing costs and preserving justice. Most obviously, manual review takes longer and incurs higher costs that TAR. Manual review is usually billed at a per-reviewer hourly rate, or on a per-document basis across the entire corpus of potentially relevant documents. TAR, by contrast, only requires human review of a small subset of documents, with the technology eliminating the need to manually review large portions of the document set. Additionally, when more reviewers are used, there is a greater chance that their interpretations of relevance will vary, increasing the risk of low recall. In contrast, a small number of SMEs using TAR to extrapolate their judgments will likely have a more consistent view of relevance than dozens of reviewers. Lastly, if the recall of the manual review is not at an acceptable level, the existing coding must be manually reevaluated on a document-by-document basis—a laborious and costly process. However, correcting a TAR algorithm’s incorrect predictions is simple: just change the coding designation on the document the algorithm used to make its prediction. Within hours (if not minutes) the algorithm will adjust its predictions for all related documents in the document set. Not even the John Henrys of the document review world can compete with such speed.
Broiler Chicken: A Pecking Order for the Future?
Does Broiler Chicken portend a shift in how courts will assess the Rule 26(g) reasonable inquiry going forward? Though the Order permits parties to choose either manual review or TAR, there is little doubt that the validation protocol favors TAR usage. Academic studies and in-the-field usage have established that using TAR is less expensive and faster than manual review, while being at least as accurate (if not more so). A party who chooses manual review risks (1) falling short of the accuracy mark after incurring substantial time and expense, and (2) having to reevaluate its coding decisions one-by-one.
TAR has not yet entirely surpassed human capabilities, but that time may be coming. In the Grossman and Cormack JOLT study, on one topic the humans did outperform the algorithm. Though less than 1% separated man from machine, this slim win for human accuracy shows that traditional methods are not yet obsolete. For now, courts generally accept that “[r]esponding parties are best situated to evaluate and select the procedures, methodologies, and technologies for their e-discovery process.” Still, the time may be coming when “it might be unreasonable for a party to decline to use TAR.” After all, John Henrys are rare, but the technology to rival him is cheap, fast, and readily available.
 Bruce Springsteen with the Seeger Sessions Band, BBC Four Sessions (2008) (covering Pete Seeger’s song “John Henry”), https://www.youtube.com/watch?v=U3eutnpTr3E.
 For an alternative theory, see William Grimes, Taking Swings at a Myth, With John Henry the Man, N.Y. Times (Oct. 18, 2006) (“A smoothly coordinated human team had an advantage over the early drills, which constantly broke down. The machines were highly efficient, however, at generating clouds of silicon dust. Contrary to the picture presented by the ballads, John Henry would have died not of exhaustion or a burst heart, but of silicosis, a fatal, fast-moving lung disease that took the lives of hundreds of railroad workers.”).
 Fed. R. Civ. P. 26(g).
 See Herbert L. Roitblat, The Pendulum Swings: Practical Measurement in eDiscovery, OrcaBlog (Nov. 4, 2014, 3:57 PM), https://web.archive.org/web/20150405030235/http://orcatec.com/2014/11/04/the-pendulum-swings-practical-measurement-in-ediscovery/ (“We have gone from assessing eDiscovery on the basis of an attorney’s opinion: ‘I’m a professional and I conducted a professional enquiry,’ to . . . a perceived need to ‘prov[e] that we have obtained a certain level of recall.’”); see also Hyles v. New York City, No. 10CIV3119ATAJP, 2016 WL 4077114 at *3 (S.D.N.Y. Aug. 1, 2016); Ford Motor Co. v. Edgewood Properties, Inc., 257 F.R.D. 418, 427–28 (D.N.J. 2009); The Sedona Conference, The Sedona Principles, Third Edition: Best Practices, Recommendations & Principles for Addressing Electronic Document Production, 19 Sedona Conf. J. 1, 52 (2018) (“The requesting party has the burden on a motion to compel to show that the responding party’s steps to preserve and produce relevant electronically stored information were inadequate.”).
 Maura R. Grossman & Gordon V. Cormack, The Grossman-Cormack Glossary of Technology-Assisted Review, 7 Fed. Cts. L. Rev. 1, 32 (2013) (Technology-assisted review is a “process for Prioritizing or Coding a Collection of Documents using a computerized system that harnesses human judgments of one or more Subject Matter Expert(s) on a smaller set of Documents and then extrapolates those judgments to the remaining Document Collection.”).
 See Roitblat, supra at note 4.
 See Da Silva Moore v. Publicis Groupe & MSL Grp., 287 F.R.D. 182, 192 (S.D.N.Y. 2012) (“[I]t is unlikely that courts will be able to determine or approve a party’s proposal as to when review and production can stop until the computer-assisted review software has been trained and the results are quality control verified. Only at that point can the parties and the Court see where there is a clear drop off from highly relevant to marginally relevant to not likely to be relevant documents.”).
 In re Broiler Chicken Antitrust Litig., No. 1:16-cv-08637, 2018 U.S. Dist. LEXIS 33140 at *50–51 (N.D. Ill. Jan. 3, 2018).
 Fed. R. Civ. P. 26(g).
 See Fed. R. Civ. P. 37(a)(4).
 Fed. R. Civ. P. 26 advisory committee notes to the 1983 amendments.
 R & R Sails Inc. v. Ins. Co. of Penn., 251 F.R.D. 520, 525 (S.D. Cal. 2008).
 Gucci America, Inc. v. Costco Wholesale, No. 02 Civ. 3190 (DAB) (RLE), 2003 WL 21018832 at *2 (S.D.N.Y. May 6, 2003).
 St. Paul Reinsurance Co. v. Commercial Financial Corp., 198 F.R.D. 508, 511 (N.D. Iowa 2000).
 See The Sedona Conference, supra note 4, at 52.
 See, e.g., Datel Holdings Ltd. v. Microsoft Corp., No. C-09-05535 EDL, 2011 WL 866993 at *4 (N.D. Cal. Mar. 11, 2011) (finding that party took reasonable steps to prevent disclosure of privileged documents including providing review attorneys with written instructions and a tutorial for the review); Kandel v. Brother Int’l Corp., 683 F. Supp. 2d 1076, 1085–86 (C.D. Cal. 2010) (finding that party had taken reasonable steps to prevent disclosure of privileged documents by staffing and training a document-review team).
 See Datel, 2011 WL 866993 at *4; Kandel, F. Supp. 2d at 1085–86.
 Scott A. Moss, Litigation Discovery Cannot Be Optimal but Could Be Better: The Economics of Improving Discovery Timing in A Digital Age, 58 Duke L.J. 889, 892 (2009).
 Jennifer Booton, Don’t Send Another Email Until You Read This, MarketWatch (Mar. 9, 2015, 10:10 AM), http://www.marketwatch.com/story/your-work-emails-are-now-worth-millions-of-dollarsto-lawyers-2015-03-06.
 See Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII Rich. J.L. & Tech. 1, 37 (2011) (reporting that manual reviewers identified between 25% and 80% of relevant documents, while technology-assisted review returned between 67% and 86%); see also Herbert L. Roitblat et al., Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, J. of Am. Soc’y for Info. Sci. & Tech. 70, 79 (2010) (performing an empirical assessment to “answer the question of whether there was a benefit to engaging in a traditional human review or whether computer systems could be relied on to produce comparable results,” and concluding that “[o]n every measure, the performance of the two computer systems was at least as accurate (measured against the original review) as that of human re-review.”).
 eDiscovery Institute Survey on Predictive Coding, eDiscovery Institute 3 (Oct. 21, 2010), https://www.discovia.com/wp-content/uploads/2012/07/2010_EDI_PredictiveCodingSurvey.pdf.
 Anne Kershaw & Joseph Howie, Crash or Soar? Will the Legal Community Accept “Predictive Coding?”, L. Tech News, Oct. 2010, http://www.knowledgestrategysolutions.com/wp-content/uploads/LTN_CrashOrSoar_2010_Oct.pdf.
 See, e.g., Green v. Am. Modern Home Ins. Co., No. 14–CV–04074, 2014 WL 6668422 at *1 (W.D. Ark. Nov. 24, 2014); Dynamo Holdings Ltd. P’Ship v. Comm’r of Internal Revenue, 143 T.C. 183, 185 (T.C. Sept. 17, 2014); Aurora Coop. Elevator Co. v. Aventine Renewable Energy–Aurora W. LLC, No. 12 Civ. 0230, Dkt. No. 147 (D. Neb. Mar. 10, 2014); Edwards v. Nat’l Milk Producers Fed’n, No. 11 Civ. 4766, Dkt. No. 154: Joint Stip. & Order (N.D. Cal. Apr. 16, 2013); Bridgestone Am., Inc. v. IBM Corp., No. 13–1196, 2014 WL 4923014 (M.D. Tenn. July 22, 2014); Fed. Hous. Fin. Agency v. HSBC N.A. Holdings, Inc., 11 Civ. 6189, 2014 WL 584300 at *3 (S.D.N.Y. Feb. 14, 2014); EORHB, Inc. v. HOA Holdings LLC, No. Civ. A. 7409, 2013 WL 1960621 (Del. Ch. May 6, 2013); In re Actos (Pioglitazone) Prods. Liab. Litig, No. 6:11–MD–2299, 2012 WL 7861249 (W.D. La. July 27, 2012) (Stip. & Case Mgmt. Order); Global Aerospace Inc. v. Landow Aviation LP, No. CL 61040, 2012 WL 1431215 (Va. Cir. Ct. Apr. 23, 2012); Da Silva Moore v. Publicis Groupe & MSL Grp., 287 F.R.D. 182 (S.D.N.Y. 2012).
 Andrew Peck, Search, Forward, L. Tech. News, Oct. 2011, 25, 29 (“[I]f the use of predictive coding is challenged in a case before me, I will want to know what was done and why that produced defensible results. I may be less interested in the science behind the ‘black box’ of the vendor’s software than in whether it produced responsive documents with reasonably high recall and high precision.”).
 To calculate recall, a party must review a mix of relevant and irrelevant documents. The party then codes the documents for relevance and compares its decisions about relevance against the algorithm’s predictions about relevance. The more alignment between human reviewer and the algorithm’s predictions, the better the recall. See Grossman & Cormack, supra note 5 at 27 (defining “recall” as “[t]he fraction of Relevant Documents that are identified as Relevant by a search or review effort.”).
 Da Silva Moore, 287 F.R.D. at 185 (“The Court explained that ‘where [the] line will be drawn [as to review and production] is going to depend on what the statistics show for the results,’ since ‘[p]roportionality requires consideration of results as well as costs.’”).
 John Tredennick, Ask Catalyst: In TAR, What Is Validation And Why Is It Important?, Catalyst (Sept. 27, 2016), https://catalystsecure.com/blog/2016/09/ask-catalyst-in-tar-what-is-validation-and-why-is-it-important/.
 See id.; see also Rio Tinto PLC v. Vale S.A., 306 F.R.D. 125, 128–29 (S.D.N.Y. 2015); Bridgestone, 2014 U.S. Dist. WL 4923014 at *1 (“[O]penness and transparency in what Plaintiff is doing will be of critical importance. Plaintiff has advised that they will provide the seed documents they are initially using to set up predictive coding. The Magistrate Judge expects full openness in this matter.”); Progressive Cas. Ins. Co., 2014 U.S. Dist. WL 3563467 at *11 (declining to allow predictive coding when counsel was “unwilling to engage in the type of cooperation and transparency that . . . is needed for a predictive coding protocol to be accepted by the court . . . .”); Transcript of Record at 9, 14, Fed. Hous. Fin. Agency v. JPMorgan Chase & Co., No. 1:11-cv-06188 (S.D.N.Y. July 24, 2012) (bench decision requiring transparency and cooperation, including giving the plaintiff full access to the seed set’s responsive and non-responsive documents except privileged).
 See The Sedona Conference, The Sedona Conference Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery, 8 Sedona Conf. J. 189, 199 (2007) (“[T]here appears to be a myth that manual review by humans of large amounts of information is as accurate and complete as possible– perhaps even perfect–and constitutes the gold standard by which all searches should be measured.”).
 Ellen M. Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt. 697, 701 (2000) (concluding that assessors disagree that a document is relevant at least as often as they agree).
 Roitblat et al., supra note 22, at 74.
 David C. Blair & M.E. Maron, An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System, 28 Commc’ns Ass’n Computing Mach. 289, 295–96 (1985) (finding that paralegals who thought they had retrieved 75% of relevant documents using iterative keyword searches had only found 20%).
 Roitblat, supra note 4.
 See sources cited supra note 4.
 See John Tredennick, Measuring Recall in E-Discovery Review, Part Two: No Easy Answers, Catalyst (Dec. 5, 2014), https://catalystsecure.com/blog/2014/12/measuring-recall-in-e-discovery-review-part-two-no-easy-answers/; Ralph Losey, Visualizing Data in a Predictive Coding Project, e-Discovery Team (Nov. 9, 2014, 8:10 PM), https://e-discoveryteam.com/2014/11/09/visualizing-data-in-a-predictive-coding-project/; Herbert L. Roitblat, The Pendulum Swings: Practical Measurement in eDiscovery, OrcaBlog (Nov. 4, 2014, 3:57 PM), https://web.archive.org/web/20150405030235/http://orcatec.com/2014/11/04/the-pendulum-swings-practical-measurement-in-ediscovery/; John Tredennick, Measuring Recall in E-Discovery Review, Part One: A Tougher Problem Than You Might Realize, Catalyst (Oct. 15, 2014), https://catalystsecure.com/blog/2014/10/measuring-recall-in-e-discovery-review-a-tougher-problem-than-you-might-realize-part-1/.
 Too bad for the people who became lawyers because they were told there would be no math.
 Michele C. S. Lange, TAR Protocol Rules the Roost: In Re Broiler Chicken, The ACEDS eDiscovery Voice (Feb. 8, 2018), http://www.aceds.org/blogpost/1653535/294460/TAR-Protocol-Rules-the-Roost-In-Re-Broiler-Chicken.
 Id. at *45, *47.
 Id. at *50–51.
 Roitblat et al., supra note 22, at 79.
 Fed. R. Civ. P. 1.
 See eDiscovery Institute Survey on Predictive Coding, eDiscovery Institute, ii, 3 (Oct. 21, 2010), https://www.discovia.com/wp-content/uploads/2012/07/2010_EDI_PredictiveCodingSurvey.pdf.
 See Grossman & Cormack, supra note 22 at 37; Roitblat et al., supra note 22 at 79.
 See Fed. R. Civ. P. 1.
 See eDiscovery Institute, supra note 53.
 See, e.g., Patrick Oot et. al., Mandating Reasonableness in A Reasonable Inquiry, 87 Denv. U. L. Rev. 533, 548 (2010) (“[To review 1.6 million documents] [i]t took the attorneys four months, working sixteen hours per day seven days per week, for a total cost of $13,598,872.61 or about $8.50 per document.”).
 See Grossman & Cormack, supra note 5.
 See Roitblat et al., supra note 22, at 74.
 See Voorhees, supra note 33.
 See Grossman & Cormack, supra note 22, at 37; Roitblat et al., supra note 22, at 79.
 Grossman & Cormack, supra note 22, at 37.
 See, e.g., Hyles v. New York City, No. 10CIV3119ATAJP, 2016 WL 4077114 at *3 (S.D.N.Y. Aug. 1, 2016); Kleen Prod. LLC v. Packaging Corp. of Am., No. 10 C 5711, 2012 WL 4498465 at *5 (N.D. Ill. Sept. 28, 2012); Ford Motor Co. v. Edgewood Properties, Inc., 257 F.R.D. 418, 427 (D.N.J. 2009); see also The Sedona Conference, supra note 4, at 52.
 Hyles, 2016 WL 4077114 at *3.
Image Source: https://www.nrm.org/2016/06/john-henry-2009/.