A 2011 report by Maura Grossman and Gordon Cormack showed technology-assisted review to be 50 times more efficient than manual review at that time. Adoption of e-discovery technology has grown since and continues to raise new issues and challenges for attorneys wishing to fulfill their obligations under the law.
Electronic discovery (commonly called “e-discovery”) refers to the exchange of electronically stored information (ESI) in civil litigation and government investigations.
Before computer use was widespread, discovery was paper-based and inherently self-limiting. Creating and storing paper (or microfilm) is costly in both funds and physical space. By comparison, the creation and storage of electronic data is virtually free. This has been true for some time, but the leap seems ever more drastic as computer technology improves.
The cost of computer processing power and data storage is declining rapidly. The ease of electronic information generation and storage has led to an enormous upswing in the amount of data any given institution maintains. That upswing has, in turn, caused huge growth in the cost and effort involved in e-discovery.
Happily, new software and services are being designed to help with e-discovery. Computers first created obstacles for efficient discovery by enabling the creation of massive amounts of electronic data. But now, their power may also make e-discovery manageable — by allowing legal professionals to offload ever more difficult tasks to machines.
Big Data Explosion
As mentioned above, the sheer quantity of existent big-business ESI is growing exponentially. To all appearances, the cost of data creation and storage will continue to fall. As such, companies will tend to risk over-saving their data, especially if they are unsure of their current data retention obligations. It will be cheaper to err on the side of unnecessarily vigorous compliance than it wouold be to risk the wrath of courts, especially considering the minimal possible savings in storage costs.
Unfortunately, proper organization and maintenance of this massive new quantity of ESI is lacking. Therefore, a great deal of existent ESI is unstructured and has unknown characteristics. This type of data presents both the greatest costs and the greatest risks. The costs are high because the data must be analyzed manually, with no assistance from computers, in the scope of discovery. The risks are high because it is easy to fail to recognize data's relevance or privilege when wading through an unorganized mass.
The most expensive stage of e-discovery is document review. An organization may have to produce a wide host of documents for a case. Each one must be examined to determine whether it is relevant to the litigation/investigation and whether it contains privileged information. In many cases, up to three quarters of the expense of e-discovery goes toward document review.
Over the past decade, U.S. courts and regulators have made it very clear that e-discovery is not limited by borders. Companies large and small must deal work on this international scope. Increasingly, even very small companies transact business in foreign countries or maintain an international supply chain.
Different jurisdictions have different standards for the secure storage and transmission of data. For international businesses, compliance with foreign privacy and data protection laws can present big challenges. Domestic companies need to exercise caution when moving data out of the European Union, for example. Many countries are restrictive in their scope of discovery when compared with the U.S. Globalization both limits e-discovery capabilities and expands the quantity of data that a given company needs to operate.
Predictive Coding (Technology-Assisted Review)
Software engineers have made great strides in teaching computers how to understand the content and the context of information. Google, for example, is constantly enhancing its search engine's ability to return relevant information to users. Years ago, the search engine was scarcely able to distinguish genuine content from keyword-stuffed nonsense. Now, its algorithms are sophisticated enough to learn a great deal about content from its similarities and differences to other content.
Likewise, learning algorithms allow document review software to determine the relative responsiveness of a document by understanding how it relates to other responsive and unresponsive documents. And like modern search engines, this new predictive coding software knows that relevance goes beyond the mere presence of keywords.
Searching for responsive documents with a traditional keyword search is very limiting. Depending on the software used, such a search will not typically distinguish between the various meanings of a single word. In such cases, all uses of the word will be interpreted as relevant, regardless of context. Such a search also neglects synonyms of search terms, possibly allowing responsive documents to go unnoticed.
Predictive coding software (also called technology-assisted review software) is programmed with a much broader set of language rules than keyword searches can utilize. Technology-assisted review software can understand words in context and search for concepts (as opposed to mere words and phrases). It can even recognize the emotional content of language, which helps reviewers search for evidence of wrongdoing.
In order to learn what makes a document relevant to a particular case, the software must first be trained. A fraction of all documents of potential relevance, selected at random, is reviewed by attorneys versed in both the case at hand and the software's operations. The system then analyzes each vetted document to determine the similarities and differences between relevant and irrelevant documents. It then forms a set of rules by which to determine the relevance of all other documents in the case.
The system will score each document on its relevance to the rules. After the initial automated analysis is complete, a reviewing team will evaluate the quality of the software's analysis by manually reviewing a sample of documents across the range of scores. Documents with very high or very low scores should always be correctly designated by the software as relevant or irrelevant; some number of false negatives and false positives for marginal cases should be expected. If the performance does not pass muster, the reviewing team will make corrections to the software and re-run the analysis. The more documents that are manually reviewed and fed into the software as learning material, the better the automatic analysis will be.
Once satisfied that the computer analysis is reliable, the reviewing team will review every document beginning with those confidently deemed relevant. The review will continue down the range of relevance scores until counsel are satisfied that the risk of false negatives is outweighed by the cost of additional manual review.
Predictive coding has come a long way, but the technology is still in its infancy and has some limitations. Predictive coding is viable only in large cases — those involving documents numbering in the tens of thousands. This is true for three reasons. First, the minimum fraction of all documents that must be manually reviewed and taught to the software is small, but the minimum number of documents is not. Computers cannot adequately learn what constitutes relevance from a sample set of mere dozens of documents. Second, the savings in attorney-hours will only outweigh the cost of the software or service if a large number of documents is automatically analyzed. Third, current predictive coding is only useful for text; it does not work with diagrams, pictures, audio or video. Computers will understand these types of information as well as they do text someday, but that day is not close at hand.
Predictive coding is fairly new technology, and it is not yet fully accepted by litigants and courts. Doubts remain as to the defensibility of relying on software to determine documents' significance. As the technology improves, accuracy rates and savings will continue to climb, and wider acceptance will surely follow.
In the meantime, predictive coding is particularly useful in internal investigations. Companies that gain experience with the technology in this capacity will be better able to judge its applicability to their own litigation and regulatory investigations. Moreover, companies facing a massive increase in stored data because of international endeavors can use the current technology to keep their varied and expansive documentation under control.
Data extraction technology is closely related to predictive coding; both use sets of rules to understand the context of information. This software piggybacks on the sophisticated rule sets of predictive coding to extract and return certain data points from documents to the user.
Data extraction software might contain rules to recognize various types of documents. For instance, it could learn to recognize letters by the presence of addresses, salutations and signatures. From there, the software can be programmed to discover who sent the letter and who received it.
Information governance (IG) is an emerging discipline concerning the management of information at an enterprise level. It fulfills regulatory, legal, risk and operational requirements for a company's electronic data. Information governance tackles the dominant problems of e-discovery — volume of data and lack of structure — at their cores.
Information governance is closely related to records management, but it also encompasses several important, additional aspects. Traditional records management deals only with creation, retention, storage and disposition. IG adds privacy, e-discovery requirements, access controls, storage optimization and metadata management.
Traditionally, records management personnel could work in relative isolation within a discrete department of a company. Information governance, on the other hand, is a large, all-encompassing business initiative. An IG authority within a company must pull together the work of IT, corporate counsel, litigation counsel, compliance and other departments. As such, large enterprises often create IG committees.
Metadata — data about data — is an aspect of IG that enhances the capabilities of predictive coding technology. Predictive coding is designed to help computers understand documents through rules about human language. Metadata is tailored to computer information formatting. It is not nuanced or open to interpretation like language is. By combining metadata with predictive coding, a computer can find a great deal of context in which to better understand documents.
Information governance will be key to controlling the cost of e-discovery in the long term. It can take years to fully develop and implement an IG strategy, but the effort can be divided into small projects. Each project also provides value in the short term, reducing litigation costs and spending on data storage.
In short, information governance is records management for the 21st century. The term itself may not stick, but it should be self-evident that efficient, thorough and effective management of electronic data will be necessary for years to come.
In-House E-Discovery and Service Providers
Larger corporations may choose to bring e-discovery completely in-house, investing in software, computer servers, installation, staff (both client-facing staff and additional IT staff) and training. A comprehensive information governance strategy is a prerequisite to a cost-efficient, in-house e-discovery system.
Several important trends promise to expand the ranks of companies capable of handling e-discovery in-house. Chief among these are predictive coding, information governance, falling computer hardware prices, increased competition in e-discovery vendor markets and wider availability of e-discovery-proficient legal and IT staff. For the near future, however, the rise of these trends lags far behind the big data explosion that first sent the cost of discovery skyward.
Therefore, most companies whose e-discovery budgets are neither very small nor very large will likely need to hire e-discovery service providers. A “managed services” approach involves the outsourcing of day-to-day management of e-discovery tasks, reserving legal staff resources for case strategy and substance. A “shared services” model outsources some e-discovery tasks and keeps others in-house. For instance, a company might handle preservation, collection and early case assessment while outsourcing certain tasks within the review or the review itself.
No matter the approach, companies must bear in mind that home-grown e-discovery is likely to attract greater scrutiny by courts and regulators than work done by outside counsel.
The cost of e-discovery is spiraling out of control. Technology — namely, predictive coding combined with information governance — is beginning to create solutions. The rate of data creation, of litigation and of government investigation is not going to decline anytime soon. Companies must adapt and plan for long-term solutions as well as for the needs of each specific case.
Predictive coding has limits to application and practicality. Its utility in an average case may be questionable for now. But companies should not remain ignorant of the technology's potential and the current state of the art.
Similarly, information governance is a costly and complex problem. Firms should implement a solution over a long-term period and only after a thorough cost-benefit analysis. But companies should not behave as if comprehensive data stewardship is not a requirement for successful business in the 21st century.
These technologies and those still to come will eventually temper the inflation of e-discovery costs created by the rise in data creation, litigation and regulation.