010 – Carl Hoffman (CEO, Basis Technology) on text analytics, NLP, entity resolution, and why exact match search is stupid

Experiencing Data with Brian O'Neill (Designing for Analytics)
Experiencing Data with Brian T. O'Neill
010 - Carl Hoffman (CEO, Basis Technology) on text analytics, NLP, entity resolution, and why exact match search is stupid

My guest today is Carl Hoffman, the CEO of Basis Technology, and a specialist in text analytics. Carl founded Basis Technology in 1995, and in 1999, the company shipped its first products for website internationalization, enabling Lycos and Google to become the first search engines capable of cataloging the web in both Asian and European languages. In 2003, the company shipped its first Arabic analyzer and began development of a comprehensive text analytics platform. Today, Basis Technology is recognized as the leading provider of components for information retrieval, entity extraction, and entity resolution in many languages. Carl has been directly involved with the company’s activities in support of U.S. national security missions and works closely with analysts in the U.S. intelligence community.

Many of you work all day in the world of analytics: numbers, charts, metrics, data visualization, etc. But, today we’re going to talk about one of the other ingredients in designing good data products: text! As an amateur polyglot myself (I speak decent Portuguese, Spanish, and am attempting to learn Polish), I really enjoyed this discussion with Carl. If you are interested in languages, text analytics, search interfaces, entity resolution, and are curious to learn what any of this has to do with offline events such as the Boston Marathon Bombing, you’re going to enjoy my chat with Carl. We covered:

  • How text analytics software is used by Border patrol agencies and its limitations.
  • The role of humans in the loop, even with good text analytics in play
  • What actually happened in the case of the Boston Marathon Bombing?
  • Carl’s article“Exact Match” Isn’t Just Stupid. It’s Deadly.
  • The 2 lessons Carl has learned regarding working with native tongue source material.
  • Why Carl encourages Unicode Compliance when working with text, why having a global perspective is important, and how Carl actually implements this at his company
  • Carl’s parting words on why hybrid architectures are a core foundation to building better data products involving text analytics

Resources and Links:

Quotes from Today’s Episode

"One of the practices that I’ve always liked is actually getting people that aren’t like you, that don’t think like you, in order to intentionally tease out what you don’t know. You know that you’re not going to look at the problem the same way they do..." -- Brian O’Neill

"Bias is incredibly important in any system that tries to respond to human behavior. We have our own innate cultural biases that we’re sometimes not even aware of. As you [Brian] point out, it’s impossible to separate human language from the underlying culture and, in some cases, geography and the lifestyle of the people who speak that language…" -- Carl Hoffman

"What I can tell you is that context and nuance are equally important in both spoken and written human communication…Capturing all of the context means that you can do a much better job of the analytics." -- Carl Hoffman

"It’s sad when you have these gaps like what happened in this border crossing case where a name spelling is responsible for not flagging down [the right] people. I mean, we put people on the moon and we get something like a name spelling [entity resolution] wrong. It’s shocking in a way." -- Brian O’Neill

"We live in a world which is constantly shades of gray and the challenge is getting as close to yes or no as we can.”-- Carl Hoffman

Episode Transcript

Brian: Hey everyone, it’s Brian here and we have a special edition of Experiencing Data today. Today, we are going to be talking to Carl Hoffman who’s the CEO of Basis Technology. Carl is not necessarily a traditional what I would call Data Product Manager or someone working in the field of creating custom decision support tools. He is an expert in text analytics and specifically Basis Technology focuses on entity resolution and resolving entities across different languages. If your product, or service, or your software tool that you’re using is going to be dealing with inputs and outputs or search with multiple languages, I think your going to find my chat with Carl really informative. Without further ado here’s my chat Mr. Carl Hoffman.

All right. Welcome back to Experiencing Data. Today, I’m happy to have Carl Hoffman on the line, the CEO of Basis Technology, based out of Cambridge, Massachusetts. How’s it going, Carl?

Carl: Great. Good to talk to you, Brian.

Brian: Yeah, me too. I’m excited. This episode’s a little but different. Basis Tech primarily focuses on providing text analytics more as a service as opposed to a data product. There are obviously some user experience ramifications on the downstream side of companies, software, and services that are leveraging some of your technology. Can you tell people a little bit about the technology of Basis and what you guys do?

Carl: There are many companies who are in the business of extracting actionable information from large amounts of dirty, unstructured data and we are one of them. But what makes us unique is our ability to extract what we believe is one of the most difficult forms of big data, which is text in many different languages from a wide range of sources.

You mentioned text analytics as a service, which is a big part of our business, but we actually provide text analytics in almost every conceivable form. As a service, as an on-prem cloud offering, as a conventional enterprise software, and also as the data fuel to power your in-house text analytics.

There’s another half of our business as well which is focused specifically on one of the most important sources of data, which is what we call digital forensics or cyber forensics. That’s the challenge of getting data off of digital media that maybe either still in use or dead.

Brian: Talk to me about dead. Can you go unpack that a little bit?

Carl: Yes. Dead basically means powered off or disabled. The primary application there is for corporate investigators or for law enforcement who are investigating captured devices or digital media.

Brian: Got it. Just to help people understand some of the use cases that someone would be leveraging some of the capabilities of your platforms, especially the stuff around entity resolution, can you talk a little bit about like my understanding, for example, one use case for your software is obviously border crossings, where your information, your name is going to be looked up to make sure that you should be crossing whatever particular border that you’re at.

Can you talk to us a little bit about what’s happening there and what’s going on behind the scenes with your software? Like what is that agent doing and what’s happening behind the scenes? What kind of value are you providing to the government at that instance?

Carl: Border crossings or the software used by border control authorities is a very important application of our software. From a data representational challenge, it’s actually not that difficult because for the most part, border authorities work with linear databases of known individuals or partially known individuals and queries. Queries may be the form manually typed by an officer or maybe scan of a passport.

The complexity comes in when a match must be scored, where a decision must be rendered as to whether a particular query or a particular passport scan matches any of the names present on a watch list. Those watch list can be in many different formats. They can come from many different sources. Our software excels at performing that match at very high accuracy, regardless of the nature of the query and regardless of the source of the underlying watch list.

Brian: I assume those watch lists may vary in the level of detail around for example, aliases, spelling, which alphabet they were being printed in. Part of the value of what your services is doing is helping to say, “At the end of the day, entity number seven on the list is one human being who may have many ways of being represented with words on a page or a screen,” so the goal obviously is to make sure that you have the full story of that one individual.

Am I correct that you may get that in various formats and different levels of detail? And part of what your system is doing is actually trying to match up that person or give it what you say a non-binary response but a match score or something that’s more of a gray response that says, “This person may also be this person.” Can you compact that a little bit for us?

Carl: Your remarks are exactly correct. First, what you said about gray is very important. These decisions are rarely 100% yes or no. We live in a world which is constantly shades of gray and the challenge is getting us close to yes or no as we can. But the quality of the data in watch lists can vary pretty wildly, based on the prominence and the number of sources. The US border authorities must compile information from many different sources, from UN, from Treasury Department, from National Counterterrorism Center, from various states, and so on. The amount of detail and the degree of our certainty regarding that data can vary from name to name.

Brian: We talked about this when we first were chatting about this episode. Am I correct when I think about one of the overall values you’re doing is obviously we’re offloading some of the labor of doing this kind of entity resolution or analysis onto software and then picking up the last mile with human, to say, “Hey, are these recommendations correct? Maybe I’ll go in and do some manual labor.”

Is that how you see it, that we do some of the initial grunt work and you present an almost finished story, and then the human comes in and needs to really provide that final decision at the endpoint? Are we doing enough of the help with the software? At what point should we say, “That’s no longer a software job to give you a better score about this person. We think that really requires a human analysis at this point.” Is there a way to evaluate or is that what you think about like, “Hey, we don’t want to go past up that point. We want to stop here because the technology is not good enough or the data coming in will never be accurate enough and we don’t want to go past that point.” I don’t know if that makes sense.

Carl: It does makes sense. I can’t speak for all countries but I can say that in the US, the decision to deny an individual entry or certainly the decision to apprehend an individual is always made by a human. We designed our software to assume a human in the loop for the most critical decisions. Our software is designed to maximize the value of the information that is presented to the human so that nothing is overlooked.

Really, the two biggest threats to our national security are one, having very valuable information overlooked, which is exactly what happened in the case of the Boston Marathon bombing. We had a great deal of information about Tamerlan and Dzhokhar Tsarnaev, yet that information was overlooked because the search engines failed to surface it in response to queries by a number of officials. And secondly, detaining or apprehending innocent individuals, which hurts our security as much as allowing dangerous individuals to pass.

Brian: This has been in the news somewhat but talk about the “glitch” and what happened in that Boston Marathon bombing in terms of maybe some of these tools and what might have happened or not what might have happened, but what you understand was going on there such that there was a gap in this information.

Carl: I am always very suspicious when anyone uses the word ‘glitch’ with regard to any type of digital equipment because if that equipment is executing its algorithm as it has been programmed to do, then you will get identical results for identical inputs. In this case, the software that was in use at the time by US Customs and Border Protection was executing a very naive name-matching algorithm, which failed to match two different variant spellings of the name Tsarnaev. If you look at the two variations for any human, it would seem almost obvious that the two variations are related and are in fact connected to the same name that’s natively written in Cyrillic.

What really happened was a failure on the part of the architects of that name mentioning system to innovate by employing the latest technology in name-matching, which is what my company provides. In the aftermath of that disaster, our software was integrated into the border control workflow, first with the goal of redacting false-positives, and then later with the secondary goal of identifying false negatives. We’ve been very successful on both of those challenges.

Brian: What were the two variants? Are you talking about the fact that one was spelled in Cyrillic and one was spelled in a Latin alphabet? They didn’t bring back data point A and B because they look like separate individuals? What was it, a transliteration?

Carl: They were two different transliterations of the name Tsarnaev. In one instance, the final letters in the names are spelled -naev and the second instance it’s spelled -nayev. The presence or absence of that letter y was the only difference between the two. That’s a relatively simple case but there are many similar stories for more complex names.

For instance, the 2009 Christmas bomber who successfully boarded a Northwest Delta flight with a bomb in his underwear, again because of a failure to match two different transliterations of his name. But in his case, his name is Umar Farouk Abdulmutallab. There was much more opportunity for divergent transliterations.

Brian: On this kind of topic, you wrote an interesting article called “Exact Match” Isn’t Just Stupid. It’s Deadly. You’ve talked a little bit about this particular example with the Boston Marathon bombing. You mentioned that they’re thinking globally about building a product out. Can you talk to us a little about what it means to think globally?

Carl: Sure. Thinking globally is really a mindset and an architectural philosophy in which systems are built to accommodate multiple languages and cultures. This is an issue not just with the spelling of names but with support for multiple writing systems, different ways of rendering and formatting personal names, different ways of rendering, formatting, and parsing postal addresses, telephone numbers, dates, times, and so on.

The format of a questionnaire in Japanese is quite different from the format of a questionnaire in English. If you will get any complex global software product, there’s a great deal of work that must be done to accommodate the needs of a worldwide user base.

Brian: Sure and you’re a big fan of Unicode-compliant software, am I correct?

Carl: Yes. Building Unicode compliance is equivalent to building a solid stable foundation for an office tower. It only gets you to the ground floor, but without it, the rest of the tower starts to lean like the one that’s happening in San Francisco right now.

Brian: I haven’t heard about that.

Carl: There’s a whole tower that’s tipping over. You should read it. It’s a great story.

Brian: Foundation’s not so solid.

Carl: Big lawsuit’s going on right now.

Brian: Not the place you want to have a sagging tower either.

Carl: Not the place but frankly, it’s really quite comparable because I’ve seen some large systems that will go unnamed, where there’s legacy technology and people are unaware perhaps why it’s so important to move from Python version 2 to Python version 3. One of the key differences is Unicode compliance. So if I hear about a large-scale enterprise system that’s based on Python version 2, I’m immediately suspicious that it’s going to be suitable for a global audience.

Brian: I think about, from an experience standpoint, inputs, when you’re providing inputs into forms and understanding what people are typing in. If it’s a query form, obviously giving people back what they wanted and not necessarily what they typed in. We all take for granted things like this spelling correction, and not just the spelling correction, but in Google when you type in something, it sometimes give you something that’s beyond a spelling thing, “Did you mean X, Y, and Z?”

I would think that being in the form about what people are typing into your form fields and mining your query logs, this is something I do sometimes with clients when they’re trying to learn something. I actually just read an article today about dell.com and the top query term on dell.com is ‘Google,’ which is a very interesting thing. I would be curious to know why people are typing that in. Is it really like people are actually trying to access Google or are they trying to get some information?

But the point is to understand the input side and to try to return some kind of logical output. Whether it’s text analytics that’s providing that or it’s name-matching, it’s being aware of that and it’s sad when you have these gaps like what happened in this border crossing case where a name spelling is responsible for not flagging down these people. I mean, we put people on the moon and we get something like a name spelling wrong. It’s shocking in a way.

I guess for those who are working in tech, we can understand how it might happen, but it’s scary that that’s still going on today. You’ve probably seen many other. Are you able to talk about it? Obviously, you have some in the intelligence field and probably government where you can’t talk about some of your clients, but are there other examples of learning that’s happened that, even if it’s not necessarily entity resolution where you’ve put dots together with some of your platform?

Carl: I’ll say the biggest lesson that I’ve learned from nearly two decades of working on government applications involving multi-lingual data is the importance of retaining as much of the information in its native form as possible. For example, there is a very large division of the CIA which is focused on collecting open source intelligence in the form of newspapers, magazines, the digital equivalent of those, radio broadcast, TV broadcasts and so one. It’s a unit which used to be known as the Foreign Broadcast Information Service, going back to Word War II time, and today it’s called the Open Source Enterprise. They have a very large collection apparatus and they produce some extremely high quality products which are summaries and translations from sources in other languages.

In their workflow, previously they would collect information, say in Chinese or in Russian, and then do a translation or summary into English, but then would discard the original or the original would be hidden from their enterprise architecture for query purposes. I believe that is no longer the case, but retaining the pre-translation original, whether it’s open source, closed source, commercial, enterprise information, government-related information, is really very important. That’s one lesson. The other lesson is appreciating the limits of machine translation. We’re increasingly seeing machine translation integrated into all kinds of information systems, but there needs to be a very sober appreciation of what is and what is not achievable and scalable by employing machine translation in your architecture.

Brian: Can you talk at all about the translation? We have so much power now with NLP and what’s possible with the technology today. As I understand it, when we talk about translation, we’re talking about documents and things that are in written word that are being translated from one language to another.

But in terms of spoken word, and we’re communicating right now, I’m going to ask you two questions. What do you know about NLP and what do you know about NLP? The first one I had a little bit of attitude which assumes that you don’t know too much about it, and the second one, I was treating you as an expert. When this gets translated into text, it loses that context. Where are we with that ability to look at the context, the tone, the sentiment that’s behind that?

I would imagine that’s partly why you’re talking about saving the original source. It might provide some context like, “What are the headlines were in the paper?” and, “Which paper wrote it?” and, “Is there a bias with that paper?” whatever, having some context of the full article that that report came from can provide additional context.

Humans are probably better at doing some of that initial eyeball analysis or having some idea of historically where this article’s coming from such that they can put it in some context as opposed to just seeing the words in a native language on a computer screen. Can you talk a little bit about that or where we are with that? And am I incorrect that we’re not able to look at that sentiment? I don’t even know how that would translate necessarily unless you had a playing back of a recording of someone saying the words. You have translation on top of the sentiment. Now you’ve got two factors of difficulty right there and getting it accurate.

Carl: My knowledge of voice and speech analysis is very naive. I do know there’s an area of huge investment and the technology is progressing very rapidly. I suspect that voice models are already being built that can distinguish between the two different intonations you used in asking that question and are able to match those against knowledge bases separately.

What I can tell you is that context and nuance are equally important in both spoken and written human communication. My knowledge is stronger when it comes to its written form. Capturing all of the context means that you can do a much better job of the analytics. That’s why, say, when we’re analyzing a document, we’re looking not only the individual word but the sentence, the paragraph, where does the text appear? Is it in the body? Is it in a heading? Is it in a caption? Is it in a footnote? Or if we’re looking at, say, human-typed input—I think this is where your audience would care if you’re designing forms or search boxes—there’s a lot that can be determined in terms of how the input is typed. Again, especially when you’re thinking globally.

We’re familiar with typing English and typing queries or completing forms with the letters A through Z and the numbers 0 through 9, but the fastest-growing new orthography today is emoticons and emoji offer a lot of very valuable information about the mindset of the author.

Say that we look at Chinese or Japanese, which are basically written with thousand-year-old emoji, where an individual must type a sequence of keys in order to create each of the Kanji or Hanzu that appears. There’s a great deal of information we can capture.

For instance, if I’m typing a form in Japanese, saying I’m filling out my last name, and then my last name is Tanaka. Well, I’m going to type phonetically some characters that represent Tanaka, either in Latin letters or one of the Japanese phonetic writing systems, then I’m going to pick from a menu or the system is going to automatically pick for me the Japanese characters that represent Tanaka. But any really capable input system is going to keep both whatever I typed phonetically and the Kanji that I selected because both of those have value and the association between the two is not always obvious.

There are similar ways of capturing context and meaning in other writing systems. For instance, let’s say I’m typing Arabic not in Arabic script but I’m typing with Roman letters. How I translate from those Roman letters into the Arabic alphabet may vary, depending upon if I’m using Gulf Arabic, or Levantine Arabic, or Cairene Arabic, and say the IP address of the person doing the typing may factor into how I do that transformation and how I interpret those letters. There’s examples for many other writing systems other than the Latin alphabet.

Brian: I meant to ask you. Do you speak any other languages or do you study any other languages?

Carl: I studied Japanese for a few years in high school. That’s really what got me into using computers to facilitate language understanding. I just never had the ability to really quickly memorize all of the Japanese characters, the radical components, and the variant pronunciations. After spending countless hours combing through paper dictionaries, I got very interested in building electronic dictionaries. My interest in electronic dictionaries eventually led to search engines and to lexicons, algorithms powered by lexicons, and then ultimately to machine learning and deep learning.

Brian: I’m curious. I assume you need to employ either a linguist or at least people that speak multiple languages. One concern with advanced analytics right now and especially anything with prediction, is bias. I speak a couple of different languages and I think one of the coolest things about learning another language is seeing the world through another context. Right now, I’m learning Polish and there’s the concept of case and it doesn’t just come down to learning the prefixes and suffixes that are added to words. Effectively, that’s what the output is but it’s even understanding the nuance of when you would use that and what you’re trying to convey, and then when you relay it back to your own language, we don’t even have an equivalent between this. We would never divide this verb into two different sentiments. So you start to learn what you don’t even know to think about.

I guess what I’m asking here is how do you capture those things? Say, in our case where I assume you’re an American and I am to, so we have our English that we grew up with and our context for that. How do you avoid bias? Do you think about bias? How do you build these systems in terms of approaching it from a single language? Ultimately, this code is probably written in English, I assume. Not to say that the code would be written in a different language but just the approach when you’re thinking about all these systems that have to do with language, where does that come in having integrating other people that speaks other languages? Can you talk about that a little bit?

Carl: Bias is incredibly important in any system that tries to respond to human behavior. We have our own innate cultural biases that we’re sometimes not even aware of. As you point out, it’s impossible to separate human language from the underlying culture and, in some cases, geography and the lifestyle of the people who speak that language. Yes, this is something that we think about.

I disagree with your remark about code being written in English. The most important pieces of code today are the frameworks for implementing various machine learning and deep learning architectures. These architectures for the most part are language or domain-agnostic. The language bias tends to creep in as an artifact of the data that we collect.

If I were to, say, harvest a million pages randomly on the internet, a very large percentage of those pages would be in English, out of proportion to the proportion of the population of the planet who speaks English, just because English is common language for commerce, science, and so on. The bias comes in from the data or it comes in from the mindset of the architect, who may do something as simple-minded as allocating only eight bits per character or deciding that Python version 2 is an acceptable development platform.

Brian: Sure. I should say, I wasn’t so much speaking about the script, the code, as much as I was thinking more about the humans behind it, their background, and their language that they speak, or these kinds of choices that you’re talking about because they’re informed by that person’s perspective. But thank you for clarifying.

Carl: I agree with that observation as well. You’re certainly right.

Brian: Do you have a way? You’re experts in this area and you’re obviously heavily invested in this area. Are there things that you have to do to prevent that bias, in terms of like, “We know what we don’t know about it, or we know enough about it but we don’t know if about, so we have a checklist or we have something that we go through to make sure that we’re checking ourselves to avoid these things”? Or is it more in the data collection phase that you’re worried about more so than the code or whatever that’s actually going to be taking the data and generating the software value at the other end? Is it more on the collection side that you’re thinking about? How do you prevent it? How do you check yourself or tell a client or customer, “Here’s how we’ve tried to make sure that the quality of what we’re giving you is good. We did A, B, C, and D.” Maybe I’m making a bigger issue out of this than it is. I’m not sure.

Carl: No, it is a big issue. The best way to minimize that cultural bias is by building global teams. That’s something that we’ve done from the very beginning days of our company. We have a company in which collectively the team speaks over 20 languages, originate from many different countries around the world, and we do business in native countries around the world. That’s just been an absolute necessity because we produce products that are proficient in 40 different human languages.

If you’re a large enterprise, more than 500 people, and you’re targeting markets globally, then you need to build a global team. That applies to all the different parts of the organization, including the executive team. It’s rare that you will see individuals who are, say, American culture with no meaningful international experience being successful in any kind of global expansion.

Brian: That’s pretty awesome that you have that many languages going in the staff that you have working at the company. That’s cool and I think it does provide a different perspective on it. We talk about it even in the design firm. Sometimes, early managers in the design will want to go hire a lot of people that look like they do. Not necessarily physically but in terms of skill set. One of the practices that I’ve always liked is actually getting people that aren’t like you, that don’t think like you, in order to intentionally tease out what you don’t know, you know that you’re not going to look at the problem the same way they are, and you don’t necessarily know what the output is, but you can learn that there’s other perspectives to have, so too many like-minded individuals doesn’t necessarily mean that it’s better. I think that’s cool.

Can you talk to me a little bit about one of the fun little nuggets that stuck in my head and I think you’ve attributed to somebody else, but was the word about getting insights from medium data. Can you talk to us about that?

Carl: Sure. I should first start by crediting the individual who planted that idea in my head, which is Dr. Catherine Havasi of the MIT Media Lab, who’s also a cofounder of a company called Luminoso, which is a partner of ours. They do common sense understanding.

The challenge with building truly capable text analytics from large amounts of unstructured text is obtaining sufficient volume. If you are a company on the scale of Facebook or Google, you have access to truly enormous amount of text. I can’t quantify it in petabytes or exabytes, but it is a scale that is much greater than the typical global enterprise or Fortune 2000 company, who themselves may have very massive data lakes. But still, those data lakes are probably three to five orders of magnitudes smaller than what Google or Facebook may have under their control.

That intermediate-sized data, which is sloppily referred to as big data, we think of it as medium data. We think about the challenge of allowing companies with medium data assets to obtain big data quality results, or business intelligence that’s comparable to something that Google or Facebook might be able to obtain. We do that by building models that are hybrid, that combine knowledge graphs or semantic graphs, derived from very large open sources with the information that they can extract from their proprietary data lakes, and using the open sources and the models that we build as amplifiers for their own data.

Brian: I believe when we were talking, you have mentioned a couple of companies that are building products on top of you. Difio, I think, was one, and Tamr, and Luminoso. So is that related to what these companies are doing?

Carl: Yes, it absolutely is related. Luminoso, in particular, is using this process of synthesizing results from their customers, proprietary data with their own models. The Luminoso team grew out of the team at MIT that built something called Constant Net, which is a very large net of graph in multiple languages. But actually, Difio as well is also using this approach of federating both open and closed source repositories by integrating a large number of connectors into their architecture. They have access to web content. They have access to various social media fire hoses. They have access to proprietary data feeds from financial news providers. But then, they fuse that with internal sources of information that may come from sources like SharePoint, or Dropbox, or Google Drive, or OneDrive, your local file servers, and then give you a single view into all of this data.

Brian: Awesome. I don’t want to keep you too long. This has been super informational for me, learning about your space that you’re in. Can you tell us any closing thoughts, advice for product managers, analytics practitioners? We talked a little about obviously thinking globally and some of those areas. Any other closing thoughts about delivering good experiences, leveraging text analytics, other things to watch out for? Any general thoughts?

Carl: Sure. I’ll close with a few thoughts. One is repeating what I’ve said before about Unicode compliance. The fact that I again have to state that is somewhat depressing yet it’s still isn’t taken as an absolute requirement, which is today, and yet continues to be overlooked. Secondly, just thinking globally, anything that you’re building, you got to think about a global audience.

I’ll share with you an anecdote. My company gives a lot of business to Eventbrite, who I would expect by now would have a fully globalized platform, but it turns out their utility for sending an email to everybody who signed-up for an event doesn’t work in Japanese. I found that out the hard way when I needed to send an email to everybody that was signed up for our conference in Tokyo. That was very disturbing and I’m not afraid to say that live on a podcast. They need to fix it. You really don’t want customers finding out about that during a time of high stress and high pressure, and there’s just no excuse for that.

Then my third point with regard to natural language understanding. This is a really incredibly exciting time to be involved with natural language, with human language because the technology is changing so rapidly and the space of what is achievable is expanding so rapidly. My final point of advice is that hybrid architectures have been the best and continue to be the best. There’s a real temptation to say, “Just grow all of my text into a deep neural net and magic is going to happen.” That can be true if you have sufficiently large amounts of data, but most people don’t. Therefore, you’re going to get better results by using hybrids of algorithmic simpler machine learning architectures together with deep neural nets.

Brian: That last tip, can you take that down one more notch? I assume you’re talking about a level of quality on the tail-end of the technology implementation, there’s going to be some higher quality output. Can you translate what a hybrid architecture means in terms of a better product at the other end? What would be an example of that?

Carl: Sure. It’s hard to do without getting too technical, but I’ll try and I’ll try to use some examples in English. I think the traditional way of approaching deep nets has very much been take a very simple, potentially deep and recursive neural network architecture and just throw data at it, especially images or audio waveforms. I throw my images in and I want to classify which ones were taken outdoors and which ones were taken indoors with no traditional signal processing or image processing added before or after.

In the image domain, my understanding is that, that kind of purist approach is delivered the best results and that’s what I’ve heard. I don’t have first-hand information about that. However, when it comes to human language in its written form, there’s a great deal of traditional processing of that text that boosts the effectiveness of the deep learning. That falls into a number of layers that I won’t go into, but to just give you one example, let’s talk about what we called Orthography.

The English language is relatively simple and that the orthography is generally quite simple. We’ve got the letters A through Z, an uppercase and lowercase, and that’s about it. But if you look inside, say a PDF of English text, you’ll sometimes encounter things like ligatures, like a lowercase F followed by a lowercase I, or two lowercase Fs together, will be replaced with single glyph to make it look good in that particular typeface. If I think those glyphs and I just throw them in with all the rest of my text, that actually complicates the job of the deep learning. If I take that FI ligature and convert it back to separate F followed by I, or the FF ligature and convert it back to FF, my deep learning doesn’t have to figure out what those ligatures are about.

Now that seems pretty obscure in English but in other writing systems, especially Arabic, for instance, in which there’s an enormous number of ligatures, or Korean or languages that have diacritical marks, processing those diacritical marks, those ligatures, those orthographic variations using conventional means will make your deep learning run much faster and give you better results with less data. That’s just one example but there’s a whole range or other text-processing steps using algorithms that have been developed over many years, that simply makes the deep learning work better and that results in what we call a hybrid architecture.

Brian: So it sounds like taking, as opposed to throw it all in a pot and stir, there’s the, “Well, maybe I’m going to cut the carrots neatly into the right size and then throw them in the soup.”

Carl: Exactly.

Brian: You’re kind of helping the system do a better job at its work.

Carl: That’s right and it’s really about thinking about your data and understanding something about it before you throw it into the big brain.

Brian: Exactly. Cool. Where can people follow you? I’ll put a link up to the Basis in the show notes but are you on Twitter or LinkedIn somewhere? Where can people find you?

Carl: LinkedIn tends to be my preferred social network. I just was never really good at summarizing complex thoughts into 140 characters, so that’s the best place to connect with me. Basically, we’ll tell you all about Basis Technology and rosette.com is our text analytics platform, which is free for anybody to explore, and to the best of my knowledge, it is the most capable text analytics platform with the largest number of languages that you will find anywhere on the public internet.

Brian: All right, I will definitely put those up in the show notes. This has been fantastic, I’ve learned a ton, and thanks for coming on Experiencing Data.

Carl: Great talking with you, Brian.

Brian: All right. Cheers.

Carl: Cheers.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Subscribe for Podcast Updates

Join my DFA Insights mailing list to get weekly insights on creating human-centered data products, special offers on my training courses and seminars, and one-page briefs about each new episode of #ExperiencingData.