Dr. Kirk Borne is currently the Principal Data Scientist and Executive Advisor at Booz Allen Hamilton in Virginia. Prior to that, Dr. Borne has been an astrophysicist, professor, and major influencer in the world of Big Data. At his talks, Dr. Borne frequently speaks out on ways to use different areas of data science to improve society, including AI and machine learning.
In our wide-ranging interview, Dr. Borne discusses how data science can help all kinds of industries — from Fortune 500 companies trying to make use of a massive database to NASA’s discoveries of hundreds of previously unknown galaxies and exoplanets.
Thank you so much for joining us, Kirk. A lot of our students and readers are eager to hear about your data science experience and listen to the advice you can offer. But, let’s start with the basics. What’s your background?
My background is astrophysics. I studied physics in undergrad, then I got a PhD in Astronomy and Astrophysics at Caltech. That was quite a number of years ago — over 35 years ago. I spent the first 18 years of my career working with data systems for astronomy projects at NASA.
I would tell people that my day job was data, and my night job was data. That's just a saying as an astronomer. For my own research, I was obviously working with data, but also much of the work I did at NASA involved working with large databases and catalogues of astronomical objects.
And what did you do with all of this data?
At one point, about 20 years ago, one of the data sets that we were making available to the research community for one of these projects was over two terabytes in size. Back then, two terabytes was an enormous amount of data. We bought a terabyte storage device, and this terabyte storage device cost 60,000 dollars — it was unheard of for anyone to even want that much data. But nowadays, I'm sure most of us on our laptops have a terabyte storage, so that's no big deal.
To start, I didn't know what anyone could do with so much data, and a colleague of mine recommended looking into data mining. I didn't know what that meant, but it sounded interesting.
Data. Mining. Two words I'd never heard together before and started exploring the wonderful world of machine learning algorithms, and what we would now call data science, which is all these techniques for doing inference and model-building from data.
That probably opened up a whole new world, right?
Definitely. I realized that mostly what I had been doing in my astronomy world was more data analysis — not really looking for the patterns, trends, and hidden knowledge that's embedded in those data.
I got hooked on data science because I found it so fascinating, not only because of my mathematical background or how interesting the algorithms were, but because it was just a cool way of discovering patterns, and trends, and data. It was useful everywhere, which also made the allure of data science so strong for me. This wasn't just happening in my field and it wasn't happening just in the sciences. It was happening everywhere.
We’re always surprised by how many different fields leverage data science. Where did you see it sprouting up early on?
In national security, in health care, and before too long, it was on the social networks — but don’t forget simple web analytics and e-commerce. I remember reading about the first recommender engine on Amazon, back when Amazon first came into existence, and they had this recommender engine recommending products to people using these algorithms.
Of course, we all love recommender engines now. Shopping experience, or movie rental experience, or whatever that we do has some flavor of that.
I got on this bandwagon at NASA to say, "Hey, we should have a recommender engine for our data system, since scientists come to our website to find astronomical or space science data related to their research projects. Why don't we just say, 'Hey, the people who looked at these data also looked at these.'" All of this moved me more and more towards data science as my vocation, as opposed to astrophysics… though I've never given up for my interest there, and some research projects, especially with my students.
Speaking of your students, can you tell us a bit about how you ended up in academia?
After 18 years at NASA, I decided I really wanted to teach data science and create data science programs for the next generation, because data is eating the world.
At that point, 15 years ago — wow, time flies — I left NASA and went to George Mason University, and we started the world's first undergraduate data science degree program… as a professor of astrophysics, I never taught astrophysics. I taught data science.
I did that for 12 years and thought I would stay there as a tenured professor — teaching at a university had been a lifelong dream of mine. I didn't think I would ever leave such a position. In science, having tenure at a university is not something you give up lightly.
Then, Booz Allen Hamilton called me four years ago and made me an offer. Three years ago, I made the big switch to leave the university and become the principal data scientist for this management consulting firm.
Many people don’t leave academia, what has the change been like?
A lot of my colleagues in the academic world thought I was kind of nuts to become a management consultant. So I'd say, "Well it's not really any different than what I've always done." It's three words to me, “data to action." Data to action is what we always did in our science. We collect data, then we decide what to do. We write our research paper, we do another experiment, or we find some new colleagues to help us understand it, or we write a new proposal. It looks different, but it's not really all that different.
Now that I get to do all this cool data science stuff across many, many different industries, different organizations, different sectors — healthcare, cybersecurity, national defense — I get to have conversations about the power of data in a lot of different places.
A few of our students are curious about how you used power of data specifically in astronomy. Did you find any insights using machine learning techniques, especially unsupervised learning methods, from astronomical data?
There's a satellite in space called Kepler, which we use to discover exoplanets around other stars. The technique they're using is a very traditional data analysis technique, but to tease out the really faint signals in the data, people have used machine learning techniques to basically infer the existence of other planets around some of these stars.
In the first discoveries from that mission, they would find one or two planets at a time, but now they’re using that machine learning to find literally entire solar systems with many planets around these stars, because they're very complex signals when you have many, many planets in there.
On a cosmology scale, using computer vision techniques with deep learning or even just basic neural networks, researchers are finding a specific type of image of galaxies caused by the gravitational effect of the dark matter in our universe. So we’re actually measuring on the largest possible scale, the scale of the universe. What's the distribution of mass, and what's the dark matter content, which tells us something about how the universe began and came to the state it is today.
Can you explain how you make those inferences?
So on that scale, we’re only trying to find weak distortions in galaxy images, which are caused by this dark matter, which we don't see, but know it's there because it causes gravitational effects. That comes via our good friend Albert Einstein, who proved that gravity can cause the path of light to be affected, just in the same way when a planet orbits the sun, it's being pulled around by gravity. So light can be pulled around by gravity too — it’s called gravitational lensing. It's a very, very small effect, but on a very large scale, you can see it. To find it on a universal scale, it takes machine learning and computer vision.
From the local stars and planets up to the size of the universe are just two examples. There are all kinds of things in between. For me, the one area which I always loved the most in the data science applications — not just in astronomy, but any field — is what people would call anomaly detection, or outlier detection. When I worked with my students, we always called it surprise discovery. Finding the surprising, unexpected thing in your data.
Meaning an unexpected outlier?
It could be something as simple as an outlier, but I prefer to think the most interesting things in data are the inliers. That is, something which looks, statistically, like it's normal, but there's something very different about the object, and that sort of difference — or surprising-ness, — has to do with where that object is located in the multi-dimensional parameters.
So I'm going to give you an example that's not astronomy.
I once gave this talk at a very large oil exploration company. My company was contracted to help them develop their internal Analytics Solution Group.
They had a launch event for the group, where I was the keynote speaker. We talked about a bunch of data science techniques: Pattern discovery, trend discovery, class discovery, clustering, and so forth. When I got to surprise discovery, I talked about inliers and outliers.
I said, "So what if I get a hold of some of your data, and I extract three columns of information. I create a database with three columns of information about all your customers, all of your suppliers, all of your vendors — this whole database of the organizations and people. And I have these three columns, and I make a plot of X, Y and Z. So I won't say what the columns represent, but I'll make the plot of X, Y and Z and look statistically at each column of data."
"Suppose there's a row in that database that has the value of X right near the mean of X values, and has a value of Y for that row which is right near the mean of the Y distribution, and the value of Z is near the mean of the Z distribution for the entire database."
So every statistics book on the planet would basically say that's a normal data point. It's not an outlier. There's nothing special about it, because it's certainly within one standard deviation of the mean of the distribution in every parameter that I have, X, Y and Z.
So what's the big deal? Why would you call that thing out?
I sort of cheated. I said, "The X, Y and Z here are actually physical X, Y and Z coordinates of the location of your customer or your vendor on planet Earth. Except we don't use X, Y and Z coordinates. We actually use latitude and longitude. So for something to be at X=0, Y=0 and Z=0 approximately, that customer would have to be at the center of the Earth."
Oops. That's obviously a flaw, which reveals something crucial. You really truly can have a point that looks like it's perfectly normal from a statistics definition of outlier, that is near the midpoint of the distribution of all those coordinates, and yet it's completely bogus. There's something completely, seriously wrong there.
Another thing that's interesting that's revealed is that sometimes an outlier is a data quality problem, because obviously you don't have any customers at the center of the Earth. So an outlier sometimes can be a data quality problem, but sometimes it really can be an operational, real measurement, indicating some anomaly in a machine, or anomaly in the product, if it has an outlier value like that.
So, in other words, the most normal-looking thing in the world might be profoundly abnormal — that’s a big point for aspiring data scientists to understand.
Do you have any places you’d recommend that aspiring data scientists visit online, where they might be able to pick up on insights like that?
Well, the world is inundated not only with new things, but also with people reporting new things. Though there's literally just a firehose of sources now-a-days. What I do notice is that a lot of the sources that I look at, and the people I follow on Twitter, it turns out there tends to be a lot of duplication. There's a lot of replication, because after all, it's just the same world.
Anyway, so I feel like there are places one could go, and again, I don't want to give too much weight on any specific place where one can go and read articles, but I'll just shout out a few of my favorites. There’s KDnuggets.com, which was started as an email feed for data mining news by the great Gregory Shapiro about 30 years ago.
Another place that I go to all the time, Data Science Central, a community with around 50,000 members. Anyone can post blogs and articles, announcements, or questions for other people to answer so there's new content every day.
I can also point to my own Twitter feed, because I try to Tweet out all the interesting things that I read. Sometimes, the best way to engage sometimes is to just sort of start your own blog series. Just by doing that, it helps you to learn something, when you have to try to explain it to someone else. That's what I always tell people.
There's also the whole academic side of this. Arxiv.org is an academic research paper repository where people post their research papers every day, in many, many different fields, including computer science, machine learning, databases and a million other topic areas, so check that out — but remember, they’re academic papers.
Of course, there’s always Google. I go to Google News, and I search 'machine learning' or 'data science' or 'big data science' or 'big data machine learning'. Some combination. 'AI'. Funny enough, it's curated by Google’s own tools. They sort of cluster the results. It's a useful application of data science.
I always tell people that: don't underestimate the power of data science to help you do data science. That is an organizer of the information. It organizes the information that you need. Under the covers, there's machine learning algorithms that they're using to present information to you.
Are there any other places that aspiring data scientists can use data science to learn data science?
Have you ever used Yippy?
No, what’s Yippy?
So if you go to Yippy.com, it looks like Google — that is, it just has the search box and the name. If you type in the word 'mining' and search, it gives you a list of top results like any search engine would, but off to the side, but it also it gives you a topic map of everything related to mining. So data mining, crypto mining, oil and gas mining, mining equipment, gold mining, gem mining, history of mining, mining jobs, mining journal.
Another good example I always show to my students is bonds. If you type the word 'bonds', you find about stocks and bonds, about bond investing, or about Barry Bonds, football– I mean baseball player. You also learn about James Bond. And you also find the Bonds underwear company, which is the number one men’s clothing retailer in Australia. So when you type bonds, which one of those are you looking for?
This is clustering analysis, topic modeling of the results, of the search to present to you, not only the top search results — which is its own algorithm that determines relevance — but they also show you possibly related searches that you might have really been looking for.
It’s yet another application of data science and machine learning helping us organize search and manage data in this information overload world.
That’s definitely useful — it’s like the Did you mean feature, only you get it on every search, plus a complete visualization.
Let’s transition again to some questions from our students. One student wanted to know, what do you find most inspiring with programming languages such as Python?
Well that's an excellent nerdy question. When I started programming decades ago, in high school, we were able to access the local university mainframe through a Teletype modem. It was ridiculously slow and ridiculously clunky. I was running the Fortran language, and we started by just doing very simple programming.
Today, with Python, it's really one of these wonderful languages where you can do extremely simple things, or you can do very complex things. I would say that the power when you start getting into it comes from interesting data structures, such as dataframes and dictionaries, and it's those data structures that allow you to store complex data types and then do very simple manipulations.
I mean, you could just go A times B to multiply two matrices. Of course, that wasn't the case when I was learning programming. You didn't have those objects that you could manipulate like that. So having the object level manipulation is powerful, and at the same time, doing very simple things.
So you can look at Python code sometimes and see people calling from these enormous libraries that other programmers have created, which lets you run significant calculations in a very small number of steps. Whereas, I think back to my days when I was running Fortran and you had to literally write out every array variable with its index, and then loop through the indexes. Oh my gosh, no one does loops anymore.
Another student wanted to know, what advice you have for aspiring data scientists who have a background in programming, but not in math?
Well, speaking of Python, it has a lot of built-in ways to help you out. There's Pandas, NumPy, or SciPy. These libraries have function calls that will do the math, as long as you know what algorithm you need, or you're doing clustering, or you're doing classification, or doing whatever. You still need to know what type of algorithm or what type of function call you need, but it doesn't necessarily mean you need to know the full mathematical brain trust behind the algorithm.
Support vector machine is a good example. If you actually look into the literature around support vector machine from the mathematics side, I mean, there are entire books on it. The average person is not going to be able to get through those books. I mean, it just isn't going to happen. It's very intense.
What I did when I was learning was to write my own versions of these libraries from scratch — yes, they existed in Fortran, too. So, I wrote my own plotting programs from scratch, because I wanted to know how it worked. If you're crazy enough like me to want to do that, you can go do that, but you don't have to do that.
I don't think you can get really far in data science without some good mathematical knowledge of what's going on under the covers there, but I don't think it's necessary to get too far into the mathematics of it necessarily. I mean it's still valuable to know the limitations of algorithms and things like that, but I think it's not totally essential to get started.
One last question, because this was something that really resonated with us — you mentioned at the Future of Technology Summit that one of your biggest pieces of advice is that data scientists should talk more about their failures. Can you explain what you meant by that and why you think it’s important?
There's a professional dimension of that, and there's also a mathematical dimension. Let me start from the mathematical meaning that I had, and that is, if you're familiar with advanced algorithms like TensorFlow or earlier renditions of that, which are basically neural networks– back propagation and neural networks, or for that matter, let me just say any data science machine learning project, you're trying to minimize error, maximize accuracy.
You build a model and you measure its accuracy, its precision, measure its error, and you don't stop there. You build another model and change the features that you use, or change the weights or something, and change the type of algorithm or something, and you see did the error improve or not? Did the accuracy improve or not?
Presumably it did, if you really changed something of significance in the model. Based up on which direction the error went — did it get worse? Did it get better? You move in the direction of reduced error. So we call that gradient descent.
Gradient descent follows that error curve to its minimum — of course, you don't know where the minimum is until you hit the bottom and then come back up the other side of the valley, so you go until your error starts getting worse again. In order to know which direction to move, you need to have at least two failed models. Two poor models.
Both failures are necessary for you to get to success.
Then, there's the professional dimension of failure reporting. Any good scientist knows that you don't just get the answer and that's the end of your research. Scientists have careers. You don't just do one paper and then they're done. You learn from the research that you do. You learn from the stuff that you do how to do better, how to improve your model, how to improve your inference of the things you're modeling, whether it's the universe in my case, or the human body in a medical research world, or understanding the world if you're a geoscientist, or whatever your domain is. You do things, experiments, model-building, whatever, learn from it, and improve.
Being able to talk about how you learned from prior models, or from prior experiments is a vital, critical piece of being a scientist. You just can't say, "Oh, everything I've ever done is perfect." First of all, people won't believe you. Think about Thomas Edison. People say, "Gee you've failed 999 times." To use the metaphor, he didn't discover the functional light bulb until his 1000th attempt. Of course, I'm sure it wasn't exactly 1000, but anyway. But the story goes — so they said, "So you failed 999 times." And he said, "No, I did not fail. I learned 999 ways not to make a lightbulb."
To be able to talk about that for a job interview, I think the critical thing is to have some sort of portfolio. I mean, maybe it could be a physical portfolio that you carry in there, or it could be a GitHub portfolio that you could point someone too, or just a set of projects in your head that you talk about. Having that portfolio is more important sometimes than the academic credentials right now in the data science world, because the demand for the talent is so high.
In an interview, don’t show me only examples of the perfect model, the perfect solution, the perfect data set — that's not the way it really is. Tell me about projects where it failed. What did you learn from that? Why did it fail? What did you do to try to change the outcome, or change the model and improve?
I think the more you can talk about that failure in a professional way, talk about whether the problem was with the data, or with the algorithm, or with the modeling approach, or with the code, being able to talk about those things reveals a lot about the depth of your knowledge, as well as the depth of your maturity in the field.
So, I'm not encouraging failure for failure's sake. I would say that what we really seek is tactical failure (from which we learn and recover), not strategic failure. I mean, we're not planning to fail, we're learning from it — it’s failure for learning's sake. And I go back to the machine learning case where you can't really know if you have the best model until you've built more than at least two to show that one is better than the other, and if it is then you shouldn't stop there.
That's a great note to end on, and I think that that advice will certainly resonate with our students. We really appreciate you taking the time to chat and can’t wait to hear what the students and readers say about all of your advice.