Case Study: Sparkgram

Text mining combined with machine learning techniques is a powerful tool for gaining insight into the development and impact of language patterns. Famously, the Google Books project recently released an analysis of the entire Google Books corpus, revealing a wealth of information about the prominence of topics in writing across several languages over the past 200 years.

In D-GESS at ETH, Daniel Chen’s group (Web Site) deals similarly with uncovering subtle and sometimes hidden biases found in legal texts. SIS has been working with the Chen group to aid them in processing a particularly large corpus containing 120 years of U.S. Circuit Court opinions. To facilitate analysis, we developed a text-analysis package, Sparkgram, based on the "Big Data" distributed computing framework Spark. By using such a method, we are able to significantly reduce the time previously required to process the entire corpus and construct a so-called “bag-of-words” representation of each document, which is a necessary step for the application of sophisticated learning algorithms. The data analysis is carried out on the ETH HPC cluster Euler.