LIDS/ALL 2017

Principal research scientist Kalyan Veeramachaneni has been working in data science for over a decade. Although his areas of researchmachine learning and artificial intelligence (AI)-are at the forefront of a new wave of interest in the field, his work remains driven by a fundamental question: how to improve the mechanics of data science.

Previously a research scientist at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), Kalyan joined LIDS in 2016-one of several new ventures for him in recent years, including co-founding two startups, PatternEx and Feature Labs, and establishing his research group, the Data to AI Lab. The research group's work focuses on developing automation technologies that help data scientists parse the massive amounts of information produced by contemporary systems.

The lab's current threads of research center specifically on improving the process of predictive modeling, or AI. A type of machine learning model, prediction models are algorithms that use the patterns detected in existing data to derive, from new data, the likelihood of a specific outcome. This outcome can be defined as almost anything, so long as the dataset can be somehow used to predict it-the next show someone is likely to binge-watch on Netflix, for instance, or the location of a protein-coding region of DNA. The general process for building a predictive model (much simplified here) involves (1) problem selection: defining what the user would like to predict from their data (2) feature engineering: processing raw data into a set of features, or variables, useful for making this prediction, and (3) machine learning: transforming the feature set into a predictive model. Each of these steps encompasses a range of complex, sometimes iterative, and sometimes human-driven sub steps, all of which affect the functioning of the whole.

So, where does a data scientist begin when they are asked to build a predictive model? And what are the biggest bottlenecks they run into? Kalyan and his team set out to answer these questions by first taking the time to observe data scientists in action: "One of the things I realized in 2013 or 2014 was the numerous challenges that people face when they have actual real data, in terms of processing and deriving insights," says Kalyan, "And the only way for me to learn it firsthand was to actually try to solve a lot of problems." So for three or four years he worked on projects with data scientists in a range of industries, from healthcare to banking. This gave him a clear understanding of how predictive modeling was working (or not working) in the real world, and where automations would be most useful.

One critical discovery was that the process of feature engineering, which takes up the majority of a data scientist's time, could be automated. Feature engineering happens in two steps. The first step is called feature ideation. Here, a data scientist identifies the features, or variables, in their data that are most useful for solving the prediction problem. The second step is called feature extraction. This refers to the series of operations a data scientist runs on their raw data in order to organize it by the selected features, and to then convert those features into a form that is ready for machine learning. To give an example, if a retailer is trying to figure out how to time a sale for maximum revenue, knowing whether more shoppers come to their store on weekdays or weekends would be of value. From this, the data scientist can identify 'day of the week' as a useful feature for prediction. If their database only contains specific time stamps, though, they would need to extract the feature, converting those timestamps to something more general, such as days of the week, in order to recognize a pattern. (In other words, for predictive purposes, knowing how many customers tend to shop at the store on Tuesdays is of much more use than knowing how many customers shopped there on August 17, 2015.)

Over the past two years, Kalyan has had major breakthroughs in automating the entire feature engineering process. The vital insight was recognizing that, although the disciplines and problem types are disparate, there are many operations common to all feature-engineering computations. Critically, Kalyan also found that he could abstract and generalize these operations. This gave the team what they needed to design feature-engineering algorithms that are both general enough and flexible enough to successfully analyze data across a range of problem types. Kalyan and his team then put this all together, creating an end-to-end system called the Data Science Machine (DSM). The DSM can automatically transform raw data into a predictive model, collapsing the building process down from months to hours. It does this using the group's feature engineering algorithms, followed by a machine-learning algorithm (the last step of the predictive modeling process). For good measure, the DSM also includes an algorithm that tweaks the model's parameters after each use, optimizing its predictive value over time. And all of this is done without the need for a data scientist's involvement.

More recently, Kalyan has tackled the prerequisite for feature engineering: selecting the prediction problem. Shaped by the intrinsic challenges of working with raw data, it is a problematic process that is typically noisy and iterative, and can take months of collaboration between domain experts and data scientists to complete. Kalyan and his team, however, have found a way to flip the script, engineering a new approach that allows the data, in a sense, to speak for itself. As with feature engineering, the ability to formalize and abstract commonalities across domains is the cornerstone of the innovation. In this case, Kalyan developed a programming language called Trane to standardize problem expression. Then, using a series of algorithms they designed to run in this common language, the team has been able to generate a complete list of the prediction questions a given set of data is able to answer, increasing what can be done with a data set by a thousand-fold.

A parallel focus of Kalyan's research is humandata interactions. Here, he and his team develop AI technologies for everyday users. "One of the challenges that I'm after is: How can we enable people to build predictive models for themselves, from their own data, without the need of a lot of expertise?" he says. "That's the question that keeps me up in the night." The group's first foray provides people with the building blocks for an image-recognition app, a system they call, "Build your own Deep Learner." The idea is that the app will be simple to make-requiring just a few clear steps-and will be an easy starting point for people to do more with their data, both individually, and as part of a larger network. "A classic example would be identifying bird species," Kalyan explains. "For instance, a novice birdwatcher who encounters a bird they don't yet recognize can create an app and connect it to an online marketplace. Then the birdwatchers-the people who can actually identify the speciescan download this app, take and tag a picture, and upload it to the cloud. An automated algorithm trains and tunes a deep learning model on the cloud, and anyone who is interested can download the model. It makes a sort of community around it, and we're building that community with Al."

LIDS has proven to be the right place for Kalyan to explore this research interest and many more, giving him the time and space to develop new ideas. "One of my favorite things about LIDS is the community. It has very highly intellectual people, very driven, and yet at the same time, they are very warm," he says. "I'm very appreciative of that." And as he thinks about future research directions in machine learning, Kalyan says he sees the next phase having a more human-focused approach: "The decisions, actions, or predictions made by the systems we engineer are going to be received by a human or impact a human in ways they have not before. So we need to think about how we build these systems, and how we assess their impact given this human factor. I think LIDS is an ideal community for that kind of work." And it's a place he's happy to be.