Jim Psota is the Co-Founder and CTO of Panjiva, which was named one of the top 10 Most Innovative Data Science Companies in the World by Fast Company in 2018. Panjiva has mapped the global supply chain using a combination of over 1B shipping transactions and machine learning, and recently the company was acquired by S&P Global.
Jim has spoken about artificial intelligence and entrepreneurship at Harvard Business School, MIT, and The White House, and at numerous academic and industry conferences. He also serves on the World Economic Forum’s Working Group for Artificial Intelligence and has done Ph.D. research in computer science at MIT. Some of the topics we discuss in today’s episode include:
- What Jim learned from starting Panjiva from a data-first approach
- Brian and Jim’s thoughts on problem solving driven by use cases and people vs. data and AI
- 3 things Jim wants teams to get right with data products
- Jim and Brian’s thoughts on “blackbox” analytics that try to mask complex underlying data to make the UX easier
- How Jim balances the messiness of 20+ third-party data sources, designing a good product, and billions of data points
Resources and Links:
Quotes from Jim Psota
“When you’re dealing with essentially resolving 1.5 billion records, you could think of that you need to compute 1.5 billion squared pairs of potential similarities.”
“It’s much more fulfilling to be building for a person or a set of people that you’ve actually talked to… The engineers are going to develop a much better product and frankly be much happier having that connection to the user.”
“We have crossed a pretty cool threshold where a lot of value can be created now because we have this nice combination of data availability, strong algorithms, and compute power.”
“In our case and many other company’s cases, taking third-party data, no matter where you’re getting your data, there’s going to be issues with it, there’s going to be delays, format changes, granularity differences.”
“As much as possible, we try to use the tools of data science to actually correct the data deficiency or impute or whatever technique is actually going to be better than nothing, but then say this was imputed or this is reported versus imputed…then over time, the user starts to understand if it’s gray italics [the data] was imputed, and if it’s black regular text, that’s reported data, for example.”
Brian: Hey everyone, it’s Brian here with Experiencing Data. On this episode, I’m going to interview Jim Psota of Panjiva. Panjiva is an enterprise data product SaaS company that provides analytics and insights on the global supply chain. Jim’s going to talk to us about three things to get right when you’re creating a data product. He’s also going to go into some of the lessons that he’s learned from the early days starting from a data perspective versus starting from a customer perspective. They’ve recently been acquired as well so I’m going to let Jim talk about his journey with Panjiva. Here’s my interview with Jim Psota.
Welcome back to Experiencing Data. This is Brian O’Neill. I’m really excited to have my friend Jim Psota on the line today. Jim is the cofounder and CTO of Panjiva which is a software company that has mapped the global supply chain using a culmination of over 1 billion shipping transactions and machine learning.
Jim and I have known each other for a couple of years and they’re actually celebrating two recent achievements. One of them being acquired by S&P Global—which is a very old data company—and they were also just named one of the 10 most innovative data science companies in the world by Fast Company which is awesome. Can you tell us what’s going to happen now that Panjiva’s a part of S&P? Is the product going to exist on its own or are you going to fold it into their software? Tell us about what that’s going to mean for your customers and the experience.
Jim: We’re very excited to be part of S&P Global. S&P, just by way of background is an old and one of the first—if not the first—data company founded in 1860 as standard and force rating railroad companies has evolved today into a company that has only the best data if of the best data on companies out there. It’s got a lot of very rich and esoteric but useful data. We’re very excited to complement the offering and also have the brand behind us since everyone knows S&P through S&P 500 etc.
Panjiva is still running on its own right now and over time, we’ll be sort of grafted into their flagship data product which is called S&P Global Market Intelligence as a new supply chain offering. The Panjiva Supply Chain Graft has been concorded or linked up to the S&P Company Graft. It’s actually something that’s already done and we’re about to roll out a new product in the market intelligence product line which essentially gives the first taste of Panjiva and over time the more advanced analytics and data visualization will also be rolled in. The medium term, panjiva.com, will go away. But I actually think that’s probably a good thing because we’re now going to go from having 20,000 customers to over 10x that many users.
Also, a lot of the data that will be used to enrich the offering within S&P but also the techniques and data science pipelines that we’ve built are actually pretty generic at this point because we’ve created over 30 datasets within Panjiva–lots of different types of datasets. We’re talking a lot of that shipping data, but we’ve also pulled a lot of company-level data as well. We’re excited to also leverage a lot of those data science pipelines and other software tools that we built for shipping data and use that sort of more generically across the S&P datasets.
Brian: That sounds like it’s going to be a big project to merge all your analytics.
Jim: For sure.
Brian: That sounds like a lot of work to get right and to do carefully and to make sure the value still is evident. Does S&P see that as you are filling gaps that they didn’t have or is it more like you’re adding on features, or insights, or additional layer? For example, they have every company that you have but they don’t have data XY and Z or analytic insight XY and Z. Is it more the enhancement or is it more of a gap fill?
Jim: It’s actually both. It’s sort of the bread and butter of S&P has been more public companies and banks inside more developed countries. There’s a lot of Panjiva data in those areas as well but Panjiva really shines in places where a lot of data providers have stumbled which is in the longtail of smaller companies, a lot of private companies. Alternative private data is very popular right now and very unique. Panjiva’s really going to be adding a lot there.
Panjiva currently has profile from the 1.5 billion shipping records that maps about 10 million unique company entities and that’s going to significantly increase the company level coverage that S&P has. They’re excited about that but as you mentioned, there’s sort of the data science and analytics capabilities are on top of that that we’re going to be working with them to fold a lot of that in as well to do sort of on-the-fly reporting, allow the product to be really customized for a particular user and use case as opposed to pre-packaging reports–that’s going to be another piece.
Finally, the team, I’m very proud of the team. They’re pretty amazing and trust me, every day, they get me excited to work closely with the Panjiva team to develop the next generation data and AI products.
Brian: You’ve also done PhD work at MIT, so you’ve got a lot of background in computer science, data products, and I’m really excited to talk to you about what you guys started at Panjiva, where you guys went. Obviously, you were doing something valuable because a: customers and b: you’ve just been acquired. Welcome to the show! Tell us a little bit more about Panjiva and what you’ve been up to over there.
Jim: Yes, Brian. Great to be here and see data products,products in general, analytics products similarly,so great to talk through this. I can say that we’ve had a lot of achievements recently, but it’s taken definitely a long time to get here and a lot of meandering of back and forth, a lot of mistakes. We can talk about all that as well.
We’re AI company focused on supply chain, helping companies engaged in global trades make better decisions when they’re confronted with different aspects of the global supply chain. We realized that global supply chain is incredibly complex, incredibly opaque, and we sought out to use data and technology to help people make better decisions. Our bread and butter is transactional level shipping data. We’ll talk about that more but suffice it to say, for now, it’s really large data and it’s really messy on structured data. That’s where the technology comes in to essentially map that data, turn it into structured data that is actually amenable to analysis. There is a lot of noise that we split out and then we essentially package that data in a SaaS product—very visual and intuitive—to essentially allow non-technical users to ask questions and gain insights from Panjiva. That’s kind of the broad strokes.
I can give you a few specific examples to make it concrete. We have over 3000 customers and those customers span across a variety of industry. Physical good manufacturers are our bread and butter, folks that are actually importing goods. We have customers like Walmart, Home Depot, but also folks that are companies that are analyzing global trade sometimes at a macro level looking at an industry or looking at a company perhaps for investment purposes. We have hedge funds and asset managers using Panjiva. Also, shipping companies use Panjiva to optimize their shipping roots.
To just give you a concrete example, imagine if you run a shipping company and you have one of your big container ships going from the West edge of the Panama Canal up to San Francisco. That boat specializes in refrigerated containers, carrying goods like vegetables. Let’s say that boat is only 60% full and you want to find customers that are shipping on that route that are shipping a particular product that you care about and in particular volumes and frequencies. We have 10 million companies in Panjiva and with a few clicks in the interface, you can very quickly find a shortlist of companies you can go talk to potentially to partner with them and get them into your shipping route. That’s a concrete example.
Another concrete example, importers such as Home Depot who’s been one of our long standing customers, they use Panjiva to find new manufacturers. If they’re coming out with a new product line or ramping up volume on a particular product line, they can look for manufacturers all over the world. It’s interesting because in the beginning of Panjiva we thought that our sort of go-to target market would be small, medium-sized customers, but we learned very quickly that even large companies—even ones with offshore offices in places like China and Vietnam—they also subscribe to Panjiva because things are constantly shifting with shifting tariff rates and product lines and companies going in and out of business. These folks are really hungry for data.
Importers use Panjiva to find new manufacturers and also to keep an eye on their competition. And then finally, exporters, we have a many customers in places like China and Vietnam. They will use Panjiva essentially to find companies that buy the goods that they make. You can go in to Panjiva and look at a groups of buyers, so to speak, categorize them, importers, and essentially find a shortlist of companies to reach out to and market to.
Brian: It’s almost like, not necessarily lead generation, but you’re almost connecting a buyer and seller–like you gave the Home Depot example. If I have that right, that might be an example such as, I’ve a vegetable garden actually in my backyard so I use this drip irrigation system and I got it at Home Depot. Let’s say Home Depot is like, “We want to have our own line of drip irrigation.” They might go and look up, “Who’s shipping from this area that specializes in plastics or something?” Or they might look at competitor and type in their name, is it something like, “Oh, let’s go look at whatever rainwater.com,” whatever the competitor is and see who’s their supplier and that type of thing? Is that how they might use the product?
Jim: Yes, exactly. Lead gen is one of our few key use cases and then in the end verse, sort of vendor generation and competitive intelligence. You hit the nail on the head, that will be a very popular case to look at as a competitor or a future competitor and say, “Were they getting their goods made?” The reason that—and this gets a little bit into the secret sauce—the reason that that’s so straightforward in Panjiva but difficult with sort of the raw material data which is the shipping data is all the structuring that we do.
We’ve essentially mapped a graph for the network of the global supply chain. The Panjiva supply chain graph, as you call it, is the largest and most complete supply chain grain in existence because we have essentially pieced together shipping data from about 20 different transaction level data sources. These data sources tend to be government agencies. In the USA,it’s Department of Homeland Security. Because we’re kind of piecing these together, you can see the primary nodes in the graph are essentially companies. The links between those nodes are supply chain transaction level relationships. You can see customers or vendors of relationships. You can very quickly navigate that graph versus kind of see how you’re connected to a particular company much like you can do so on LinkedIn or Facebook with people.
Brian: Are some of the data science aspects ,in terms of what you’re doing, have to do with resolving entities that have names that are listed Dell Computer, Dell EMC, Dell, Dell Limited, and it’s like, “Well, this is actually all one company?” Even though their manifest may have different company names on that and then understanding that that’s one logical entity, so that you can look at all of Dell’s suppliers for example. Is that one of the things you guys do in terms of cleaning the data and actually being able to provide this graph that’s accurate?
Jim: Exactly. The entity resolution technology and clustering of the data is something that we spent over a personal decade of work on…
Jim: …and is definitely a great […] fourth generation. There’s a couple of reasons why doing it at this scale that we’ve had to do it at is more difficult than de-duplicating your Salesforce account which is another example of this but at a much lower scale. When you’re dealing with essentially resolving 1.5 billion records, you could think of that you need to compute 1.5 billion squared pairs of potential similarities. If you just run the naïve calculation, that could end up taking many, many years.
Basically, the techniques we’ve developed handle both the scale but also the sort of dirtiness of the data for a given company. We may have, you gave a Dell example, we may have tens of thousands of name variations and in different languages. Right now, the Panjiva data comes in six or seven different languages, character encoding, Chinese, etc. Handling all that gracefully is difficult. But as you said, one of the key value add from the data science side, we haven’t talked about the product yet, but on the data science side that’s one of the handful key enhancement that we make to the raw data to get it into essentially a new dataset which is this combined dataset and [..ellipses.] graph that is I think the foundation of the whole company and product.
Brian: Here in the Cambridge, Massachusetts area where I am there’s a company in town called Basis Technology. I’m not sure if you are you familiar. I think they do some of the same work like text analytics and more on the Homeland Security, the immigrations, and border patrol—the software that the agents use when people come into the country and the same issue with multiple languages, anglicization of for example a Russian name that may end in OFF versus OV. It’s this federal law for something like this and knowing that that’s logically the same person. It’s interesting in that space how much clean-up you have to do. Sort of a theme on this show with a lot of the people I talk to is how much work has to be done just to get to the point where you can do some fun stuff and you can actually solve some problems because there’s so much of the data engineering and data clean-up and just getting to that point where you can do the analysis and stuff is hard.
Jim: Absolutely. First of all, I love Basis Technology. I’m friends with Carl, the founder and CEO, but you’re right it certainly is a similar problem. To your second point, yes, there’s sort of a dark reality of data science or data projects is really spending a percent of your time dealing with a, understanding the data. I mean, one of the big lessons that I’ve had and one of the things that we did not do very well at all in the early days of Panjiva was really deeply understand the semantics and definitions of the dataset before really trying to get value out of it. We kind of just loaded it and just had at it from the computer science standpoint. But we really skipped over one of the most foundational and fundamental precursor tasks to that I think which is doing a research project and really understanding, what does the data mean, what does the columns mean, what are the distributions of the data.
The data that we’re using, and this is often the case, was collected for other purposes; it was collected for collecting tariffs as goods cross borders. The data was meant for that purpose and nothing else. They happened to find a resell use cases for it but because of the data—again this is often the case—the data is not really meant to be nice, clean, normalized, error-checked data. We had to spend a lot of time and currently spend a lot of time on every new data that we’re adding and we’re continuing to add data; really understanding the semantics of the data before we even get into it.
We have had a whole team called our data science analysis team, they are sort of former academic researchers, economics type researchers ,who have some technical skills, and essentially do a deep dive for weeks per dataset to really deeply understand what the data says and means before we try to start develop a product on top of that data.
Brian: I understand that side of it which is understanding the materials that you’re going to be working with and the information that you have. Maybe this has changed over time but it’s not like you took the data-centric approach in your early days…
Jim: Early days, for sure, yes.
Brian: … and started surfacing, “Let’s get it on the screen. Now, let’s filter. Now let’s have some controls to do things with it.” Has that changed over time? How does that user, this person that’s looking for those irrigation supplies or like, “I want to change the coffee beans”,whoever it may be, how did they fit into that story if you bring in a new dataset or something like that, maybe it’s in a domain, maybe it’s new columns or new information that you don’t currently have, is there something that you guys do differently now to make sure that user actually is going to get some value out of it before it goes into the product?
Jim: This is something that you and I have spoken about a decent amount, Brian. I think we really frankly wasted a lot of time doing some kind of science experiment in the beginning phase of the company. I think there’s a place and a time for that, sure. Just kind of getting your head around the data and potential use cases but where we’ve evolved to is a very user-centric mindset about how to actually build and deliver viable products.
I think this is something that’s IT, when I’m inviting companies or helping friends, again this is what we did in the early days, it’s very common to see companies that are data source-centric, AI-centric, and I really think every company needs to be user-centric and use case-centric, and then needs to have an arsenal of tools from the data science world, from the visualization world, etc., etc., and have those tools to solve the user’s problems and tap into them as we do. Know what the user wants so you can go gather the data that the user cares about and really package the product to deliver into particular use cases.
The early days, we were a little more of a hammer looking for a nail kind of scenario. We attack the problem quite differently now although I would say it’s still not perfect and we still have a number of challenges, but we really try to focus on the used case. We have a whole product team now that we didn’t have before that really deeply understand that and then we do the technology and data piece after that.
Brian: I imagine you guys probably talk to customers at times to inform this process. Is that right?
Jim: Absolutely. We have a number of different ways to engage with our customers. Some are very high-touch and small number of data points and some are low-touch but thicker on the number of data points from the customers. We have a customer success team that is very close to our 20,000 paying users, the closest you can be to 20,000 paying users ,obviously. Closer to some more than others. We have an internal system of course to constantly be tracking feedback, that’s one way.
Our product team also has developed a small cohort of users, called our VIP Users, that essentially get early access to beta features and we have in-product feedback mechanisms and phone calls, screen shares, etc, with these users especially for the users who are really engaged and really excited about Panjiva developing product etc. That’s been a fantastic resource. Finally, there are a number of hooks that we have that are constantly measuring data that we’re able to get a very broad view about usage and discoverability, etc. Obviously, we need to interpret that data very carefully, it’s very easy to draw poor conclusions from skewed or thin data. But if you kind of put all these pieces together then we try to get pretty close to the user.
I’ll also say that in addition to, we’re talking about customer success and product, but engineering and data science is, I would say, very close to the customer of Panjiva. It’s always been very important to me because I think that when you’re developing a product, the engineers often have to make micro decisions about how exactly to implement a feature, how to lay the groundwork for a potentially future features, and just to give out accountability of the software. The engineers are going to develop a much better product and frankly be much happier having that connection to the user, it’s much more fulfilling to be building for a person or a set of people that you’ve actually talked to. We sort of touch the user in all of those ways.
Brian: That’s great. It sounds like you have some type of Google Analytics equivalent or an internal metrics on page views ,maybe a timeline page, stuff like that, but it sounds like you’ve also got some qualitative. I don’t know, is it like a chat window where I can say, “Hey, what does this button mean?” or these types of feedback mechanisms. Is that correct? You have some interface like that where they can send an email right from the interface about issues?
Brian: Got it.
Jim: Exactly. It’s sort of an ever-persistent feedback mechanism and that’s for everybody. And then on top of that where the VIP users, that beta group, what we do is we essentially build a little extra user interface element to allow all our users to type in all their feedback directly. That feedback actually goes to both the project management team or the subset of that team working on the product as well as to the specific engineers that have worked on that product. They’re able to get direct feedback. I just think you want to lower the friction as much as possible so that at that moment, when they’re feeling the annoyance or have the inspiration for a way to make it better, you just want that box to be staring them in the face. It’s not even a link that you click and then open the box, it’s just a box that then you type and click. We just want to lower that barrier.
Brian: From having to put these processes into place, is there a story or a particular anecdote that comes to mind? Something that you learned from having these customer touchpoints on a regular basis? Like we never would’ve known X had we not either asked for that passive feedback or maybe in an interview or a screen share session or something. Any particular nugget, story?
Jim: Yes. I’ll play you kind of a broad story which is a little bit more of a broad learning and maybe I can give you a specific example as well. But the broader thing is when we started the company, we basically tried to over simplify and smooth over too many details when it came to distilling this mountain of shipping data into insights. We did that by developing what was called the Panjiva Rating. We’ve actually […] that feature. But we essentially came up with our own metrics to look at the data and boil it all down to a number between 0 and 100 that assessed the “goodness” of a particular supplier.
What we learned from users—again and again and again—was that that just was oversimplifying and not really appreciating the nuance of each particular user’s questions and use case. They were all using the data for a particular use case and we started the company focusing on finding vendors or suppliers’ use case that we talked about a few minutes ago even though we were focused on the used case, the users, they didn’t have enough fidelity in the data that we are offering with a single a number. They often didn’t trust it. They kept saying, “Hey, give us the data.” At that point we were a couple years old as a company, no one had ever heard of us, we just didn’t have the trust from the users.
Frankly, the product that we were offering wasn’t very good. The number didn’t even work that well so as the metrics. That sort of taught us—and this is sort of by direct customer conversation—that at that point, they were asking for the raw data then they got the raw data. We built another piece to give our raw data and then they said, “That’s too much. I need to have a machine learning background to make sense of this.” And the pendulum kind of swung back again and we kind of ended-up where we are now, which is somewhere in the middle. But if it wasn’t for sort of getting kicked in the Ts by a lot of customers saying, “This is not what we need to answer our question,” we would not have gotten to where we are today.
Brian: You created this Panjiva Index. I’ve seen this with other products as well. Do you think the issue was really that they didn’t trust it? Was it that the data wasn’t available in the tool for them to unpack the 72 score that they saw for whatever this lead was or was it there, but they couldn’t understand how you guys packaged it up and made the index? Do you think it was that or was it a lack of belief that you guys could really boil the word down to a single number? It’s kind of like, “This 72, this is supposed to be right for me and all the other that are not really doing exactly what I’m doing so how much should I really believe in the 72?” I’m sure it’s about that thinking there.
Because theoretically, it’s on the right track. You’re trying to reduce the amount of input required to get some insight like how much tool time has to be extended in order to get some kind of value from the tool quickly because your goal is not to spend time in Panjiva probably; the goal is to get some insight out of it. I can understand the desire to go into that tactic. Could you unpack that a little bit more?
Jim: You’re right. Actually, I think it’s a bit of both, but I will unpack it and try to assign some weight to each of these. I really think that the primary problem was that it just wasn’t enough fidelity and it wasn’t enough for the users to really sink their teeth into to actually answer their question. It was on the right track, but it was too far, it was too extreme. To this day I believe that even though we have a brand now, we have a very well-known parent company that lends us a lot of credibility and grace, I still don’t think of the right level of fidelity. I think it’s too blunt of an instrument for allowing users to get insights that they really want them to get. I think that was a primary issue.
I think sort of the extra factor was they also didn’t trust us. I think there’s this, “If you want to go in that direction, you want to give users as much insight as possible while limiting the amount of time they’re spending getting that insight.” But I do think there is a point where it’s too far. It could either be too far because you’re not allowing the user to fully articulate what they want out of the tool, maybe the user interface is sort of the querying mechanism instead of the search type of interface is oversimplified. A lot of people want to emulate Google for example but sometimes Google is actually too simple, so I think you could have oversimplification on the input side, and then you can also have oversimplification on the output side in terms of how the data for the insights are ultimately presented.
Brian: Do you guys have any type of internal benchmark use cases or some kind of way of testing the product with users that you repeat over time to see if like, “Are we still delivering the quality and the value we want to give”? I imagine you probably bringing on new data or maybe you’re creating new reports, or that the product is changing over time. How do you make sure that you’re not making it worse? Obviously, when we add in information, we potentially add value but we also add noise potentially and friction. That’s where the design tier comes in. I’m just curious if you have some way of real-time monitoring, having some sense of a benchmark where the tiJme it should take to pull up a company, find all of its suppliers, and do X, we want that to take eight minutes or something. Can you talk about that a little bit? Do you have any type of process you guys use for studying that?
Jim: Yeah. I like the benchmark you just talked about. We don’t have anything that tight on the user on the flow or the UX side of things, but something that’s a good idea. I tell you what we do do. We basically compartmentalize the quality assessment. What I mean by compartmentalize is we look at it in a few different key areas. I think it’s actually important and useful to look at it both in compartments as well as holistically.
The first one is on the data side. Essentially, the quality of the data science algorithms and the machinery that processes the data. Do that independently and you can essentially use different techniques to make sure that if the data format changes, we have infrastructure in place to automatically notice that, to make sure that that doesn’t leak into the product. There are often these breakpoints or thresholds that sometimes get triggered or maybe the machine learning model was trained on one type of data. Ultimately, that model ends up being stale over time. We have so much data and so many different models that we have automated ways of checking that certain quality metrics, sort of standard data science data quality metrics are upheld. I think that’s the foundation to Panjiva is the data itself and the matching graphs so that we have lots of mechanisms in place. Over 5000 automated tests if you include the data side of the product side that are constantly running multiple times a day.
On top of that, there’s performance. That’s just like, “Is the website fast enough?” Especially for a company like ours where the amount of data is only ever increasing, often you have performance needs where all of the sudden, the data that used to fit in memory and now it’s spilling over to disk or in particular, database index for example, and then you’ll hit a performance […need]. That will show itself in the product. We have ways of essentially monitoring our key products for our key use cases to make sure that searches are fast enough, company profiles and some reports are fast. That’s compartment number two.
Compartment number three is more on the product side. There I would say that the best that we do is kind of the stuff we talked about earlier with vetting a product quality is make sure it doesn’t break. But that’s different than what this sort of holistic flow that you mentioned that I think we should be doing and we’re not.
Brian: Since it’s a process, the qualitative types of research activities and design is not binary. It’s not like you are or you aren’t. Most people are somewhere on a continuum of the habits and activities and routines that they go through to be customer-focused. Some are doing a lot of stuff, some aren’t, but that benchmarking thing is something that I think helps companies.
With my clients, especially with analytics progs, the problem space is indeed a space. It’s usually not just like, “Here, there are five things we need to do ” and that’s it. If we get that right, then we’ll sell the company and go to an island and party or whatever.” It’s never that simple.
But , actually you could use something along those lines as a benchmark to kind of check yourselves especially as products grow. I like to try to encourage clients usually. I mean, it depends on what the problem is in that particular situation. But having some kind of idea of, we have to put a stake in the ground somewhere if we’re going to evaluate the quality of the product from the user experience standpoint, we need to be able to go and run tasks with customers and ask them to perform activities that are realistic based on what their job is, so let’s pick a handful of these to get going. They may change over time but instead of trying to solve an amorphous global supply chain, in general, like we’re going to solve their problem. Well, what problem?
At some point, ink is going to go on a screen, buttons will be created, workflows will happen, and that can either be a very deliberate process or you can fall into it. My thing is like, “Well, let’s try to model it around problems that people actually have, pick a subset of those,” and then over time you’ll probably learn whether those are even the right benchmarks or not, but it will at least guide the processing, keep it from being a mediocre product for everybody instead of you actually have a really great product for a smaller set of people. That might mean some users have to suffer a little bit. They’re not going to get the A+ experience. They make it a B-, a C+ experience but you’ve decide that, “Hey, we’re not going to sacrifice the jolly Jane persona or the supply chain Eddie.” You come up with these people that kind of your models or you really need to satisfy the most and you say, “We’re not going to put that guy’s job at risk because they’re our top client, they’re our top customer, they’re the ones that actually spend two hours per day in the product. They’re not the people that check-in once a month to download this report. It’s okay if that reporting interface isn’t as great. Let’s not screw with the features and the tool that we know that Eddie is using two hours a day every morning when he gets his coffee and he sits down. The first thing he does is open email and he opens Panjiva and does X, Y, Z.
Jim: Exactly. So easy to get overwhelmed with the degrees of freedom that you have when starting a company and thinking about a new product. One of the things we struggled with, in the early days especially, is the blessing and curse of Panjiva. We have so many different types of users and use cases. It’s actually about 10 if you enumerate them, but only a few are the key ones. Eventually where we landed is we have mechanisms for all 10 of those use cases to get the value out of the product, but we really try to nail top three use cases and that is special flows and special nomenclatures, etc. for these particular users.
Brian: I’m curious. I don’t know if you call your company an engineering-driven company or not, but for ones that are, do you get the people that are always wearing the exception hat? I could see in a space like yours where there’s always going to be someone that’s going to say, “Yeah, but someone might want X,” or, “we taught that one guy that wants to do it like Y.” How do you solve that tension? Do your guys run into this problem where maybe a squeaky wheel and early customer’s been with you for a long time, has a really strong feeling about something, and I’m sure you guys have some internal debates about, “Are we going to satisfy this guy or gal with what they’re asking for?” or “Nope, we’re not going to go there because of this.” How do you handle the competing requests?
Jim: As an engineering-driven company, it’s fun to build things. You have folks join the company because they like the entrepreneurial freedom they have, to talk to customers and to develop products. Actually, right now, as we are speaking, there’s a hackathon going on and everyone’s working on projects that they came up with on their own. It could be product features, it could be data science features, but only the ones that are truly deemed to be valuable to customers are actually going to be worked on for real after the hackathon.
Back to your question, I would say in the beginning, it’s very exciting to have any customer that will listen to you at the beginning of any company, very excited to build something that does an okay job at meeting their needs and solving their pain. I think we hastily built products in the early days, oversimplified the needs of the customer, and the ultimate product that we developed. But that’s okay because we move fast and we built the next one as we learned.
I think there’s a couple of ways the tension that you’re describing manifest itself. One is, as you described, a squeaky-wheel customer that gets very upset. This manifest itself especially sort of perditiously when a salesperson is on the line with a customer that is about to close or they’re already a customer, they’re not sure if they want to renew or not, and you have to make a call. Are we going to keep this feature? Are we going to build this feature for this one potential customer who’s somewhat valuable but it’s not a crazy amount of value? That comes up all the time and that’s just the top judgment call that we constantly have to manage.
Ideally, you have a strategic direction for the product. It’s easier to make the call to build if that thing is in line with where you’re going anyway of course, but it gets really hard when you’re going to hit your monthly goal or quarterly revenue goal. This thing is a little bit out there and I think it’s great to have the freedom to have discipline to say no.
There’s another way that manifests itself beyond customers, though, which is just internal. There’s just a mix of personality that any group of people, and it’s very common for maybe a little bit more extroverted or strong-minded team member. It could be another engineer, it could be a salesperson, it could be business developers who essentially is getting an oversized share of sway in a company. It gets really important to try to acknowledge that sort of natural distribution of different personality types within any team and try to bring out of different team members and pull out of different team members the opinion that they have, and then try to have more of a cross-cutting look at what are we hearing in general from a broader set of customers, a broader set of people.
The final thing I’ll say is in another bias, which was recently bias, is very common with the way we’ve operated and has worked pretty well, is did our major planning of larger projects on kind of a quarterly basis, obviously re-evaluating every couple of weeks with our strong style iterations. But it’s very common and very easy to intentionally prioritize things that happened to have come up within a couple of weeks of the planning this meeting which maybe around beginning or end of the quarter. It’s important to try to take a little bit more of a longitudinal view of the feedback we’ve been getting over time and documenting that so that when you do do the planning, you’re smoothing over the recently bias, you’re smoothing over the strong personalities and the irate customers, and just trying to do what’s best for the company more in the medium and the long term.
Brian: I know that you have recently given a talk at a conference on this. If I recall, the title of the talk was Three Things To Get Right In Data Science. Is that correct? Could you share with us a brief version of what those things were? If that’s not quite the right title, tell us what it was?
Jim: It’s Avoiding Data Disillusionment: Three Things To Get Right When Building Data Products. First, a little bit of a preamble. There’s a reason I think we are headed towards disillusionment or at least a lot of data science projects are. There’s just a ton of hype and excitement around data, data science, machine learning, AI, and I think for good reason. We have crossed a pretty cool threshold where a lot of values are able to be created now because we have this nice combination of data availability, strong algorithms, and compute power. That combination is certainly powerful. But there are a lot of folks out there with the data set hunting for a problem to solve and aren’t necessarily going about it with a user-centric and use case focus.
Gardner put up hype cycles for different technology. If you look at the hype cycle that was put out a couple of months ago by Gardner for data science in general, pretty much all of the technology, except for a few, are in that peak. If history repeats itself for data science like it did for enterprise software, internet technologies, mobile technology, et cetera, a lot of folks are going to be disappointed over the next 3-5 years.
It’s all about the decade that we’ve spend building Panjiva and a few of the key learnings, I think for me at least, you apply these learnings,it will reduce the risk of disappointment. The three areas to get right: one is deliver what’s valuable for the user, two is demystify the technology, and three is democratize the data science talent.
On the first one, this is so obvious and sounds […contrite]. It almost feels silly to say, but we talked about this a lot on this particular conversation. I just think this is worth repeating and this is worth keeping front and center all day everyday. Just really focus on delivering something that’s valuable to the user no matter what.
When we started Panjiva, we actually were not an AI company focusing on shipping data. We were actually focused on deobfuscating the supply chain in a very different way, which is building more of a platform for a rating with subjective reviews and helping companies learn about other companies in far away places like China or India because the reviews from other people. We thought that was going to work. We saw a lot of inspirational examples. After about six months, we learned that that was just not going to work. There are a lot of incentives in play that a lot of people do not want to share their good suppliers and people want to tarnish their competitors. There are a lot of incentives in play that made that business model not viable.
But in the process of building that product, we stumbled upon some data that we were planning on using to assess the veracity of the ratings themselves. That data was shipping data. We were planning on using the shipping data to figure out if the ratings are coming from real customers. We get a rating and look up, with the data in the background, kind of manually, is this a real customer customer of this giant manufacturer?
In the process, we realized, is that this rating thing is not working but the data is actually really interesting. At first, we thought there’s just too much to get any value out of, but looking at it from a machine learning and data science angle, we realized that if we worked really hard at this, we can actually turn it into something valuable.
This is one thing we’ve got right, there was a lot of luck here, but all along we were focused on helping the user get insight about these companies around the world at a distance and developing trust at a distance, which is an age-old problem. No matter if it was a platform business or a rating business or an AI data business, we were focused on solving that problem. That’s the first lesson delivered,what’s valuable to the user.
The next one is demystify the technology. We talked about this to some degree earlier with the Panjiva rating. We’re trying to have this magical black box that took all this data, which could be tens of thousands or hundreds of thousands of shipments associated with a given company and boil all of that down to one number. In a way, we were wrapping up all of this technology behind this black box. We talked about before, the user just didn’t trust it, they don’t understand it, and they couldn’t contextualize it with this particular business problem. Demystifying the technology is really important.
This is coming into play a lot in data science and AI in particular now, where the reality is the models are quite complicated and quite sophisticated. But that doesn’t mean that we could just let it be a black box and spit out an answer. I think it’s so important to wrap that technology and give users hooks into the technology so that they can, a: trust it, and, b: take the insight, contextualize it for their particular problem or their particular business use case, and make it as user-friendly as possible. You have to demystify technology.
Three is democratize the data science talent. This is a little bit more about tactical approach that is just necessary, given the scarcity of data science talent in the world today. I don’t know the statistics but there are more job openings for [data scientists than data scientists ]out there and that’s going to be the case for quite some time. Carnegie-Mellon just came out with the first undergraduate major in AI but it does take some time.
I think it goes beyond that. I think this is beyond just leveraging the scarce data science talent to actually get products built. I think the reason that it’s important to democratize the data science talent is also because helping product managers, other engineers, and even folks doing business development understand the capabilities of data science is going to fundamentally shape how they think about developing a product and mapping the user’s need into the technology domain.
Thinking about building a model, thinking about features of the data, developing a training set, and assessing error rates and communicating how the black box works is really a fundamentally different approach to developing products. I think it’s really important to educate the broader team in the broad strokes of data science so that they understand how to leverage out the tool even if they don’t know how to write the code or think about it from a mathematical perspective. The third thing to get right when building data product is democratizing the data science talent.
Brian: In terms of the second one, in terms of demystifying the text, I’m curious how you see that line between either [user sit down and I have some task or job ]that I need to perform. With analytics, it’s usually some kind of decision support that I’m looking to get or in near could be a lead, something like that. How does the need to understand how Panjiva generated the response that I got? Where’s that line between, “Woah, I just want to go to the grocery store. I don’t need to know on the screen of my car, ‘The fuel is now being injected into the whatever,’ and you can see every part of how the engine is going to move the wheels, etc.”? Where’s that line between noise and not needing to really understand all of it versus it sounds like maybe are you saying you need to expose enough to get some trust? Is it about building the trust that’s important and then exposing the right amount of the magic sauce? Can you unpack that a little bit?
Jim: Yeah. There’s a couple of aspects here. I think the actual form that the product can take that provides a nice, happy medium, to not overwhelm the user but also give them the hooks they need to actually demystify the technology is I’ll use the onion analogy. You give them a high-level view of the insight and ideally package the product or the insight in language that maps the way they think the problem. Make them very simple at the outset but provide a set of drill and mechanisms to actually go deeper if they want to. Start with the simple thing but don’t stop there from a product standpoint.
I’ll say that that is a tactic of building a product itself. I would say that the goals though are two-fold. One is to build trust, as you said, and second is to provide context for the particular use case. In Panjiva, there are many use cases, and different use cases may need more or less fidelity and nuance, and maybe fields of data or types of customization. The ideal solution is to build a specialized product for every use case, when the user just goes in and make it exactly what they need for their use case. And maybe an advanced user peel the onion a bit and go a little bit deeper if they need to, if they want to really understand it. That’s the ideal scenario.
In our case, we kind of have a hybrid solution where a few of our use cases have that level of tightness, where we’re mapping product to use case or use case to product. But we have sort of long tail of use cases and for those, we provide a little bit more of a generic advanced interface, give folks the onion approach where they’re drilling in if they need to, give them enough hooks into the data so that they can understand how to map this insight into the particular problem that they’re solving, be it for optimizing their shipping lane or finding out information about their competitor.
Brian: I think you outlined the sort of a framework for products that have discrete conclusions or insights that you know there’s going to be repetitive need to go in and get answer A for question B, as it may be. I’ve seen that process where the framework for a design work well where you’re trying to prevent what I would call ‘presenting a conclusion first, not the evidence’ but you kind of provide that, “The answer is it’s the 72 index,” if you’re going to go something like an index, for example, but then you need to provide the right amount of supporting evidence to back that up.
One thing I’ve seen work well in this space, too, is sometimes that includes information on what you didn’t do, like, “What information did we not include”? Or, “Hey, our supply chain data is a year old, so what we’re giving you is actually from 2017 not 2018,” or, “We did not adjust for inflation,” or, “We did not do X.”
It’s not list out everything that you didn’t do because that could be a mile long, but as you get to know your customer and the questions that may be going through their head, which you only can learn by talking to them, you may be able to answer that in the interface explicitly on the “evidence page.” It may not be a single page but the place where you backed up some of those analytical conclusions.
You might give them an idea about, “Here’s what we checked, here’s what we looked at, we cross-referenced it with this, but we did not do these things,” and then they can start to believe the trust that they have a little bit more idea how the sauce works.
I’ve even seen to the point where at some point, they may even stop looking at that and now they start to really trust the conclusions because they know what goes into the recipe and they don’t need to know the recipe anymore, they just want the pizza, like, “I know it’s good, I know what kind of flour you use,I was interested the first time because it’s so good but now it’s just…”
Jim: Now I just eat it.
Jim: Absolutely. We saw actually that exact pattern play out at Panjiva where a lot of users in their first couple of months will use the more advanced interfaces, where one exports the data to Excel, double-checks the math doing pivot tables, etc. As time goes on, they’re just using the repackaged reports because hopefully, they’re better than the manual analysis over time because we kind of consider these exceptional data cases.
Any data product that’s done well or any non-trivial data product that’s done well, sometimes is going to have dozens of these things. No matter if you’re mining first party data of companies like Ballinteer, Tamer, and other companies that are essentially taking companies’ internal data and working with it. In our case and many other companies cases, taking third-party data, no matter where you’re getting your data, there’s going to be issues with it, there’s going to be delays, format changes, granularity differences. They can get overwhelming to the user, too, essentially list out every contingency, every deficiency. I think there’s a real balance here, a balancing act.
As much as possible, we try to use the tools of data science to actually correct the data deficiency or impute or whatever technique is actually going to be better than nothing, but then say this was imputed or this is reported versus imputed, I think that can lean on some common design language in the product basically, and then over time the user starts to understand if it’s gray italics that was imputed and if it’s black regular text that’s reported data, for example. I think that can help the user just sort of intuitively grasp, “Okay, you need to be a little bit more careful with this, but I [..get the jest.],” and depending on their use cases, as an example of communicating just enough to get the user who need the level of fidelity or maybe a tool tip can say imputed methodology, if they need that or if they just [ need the broad stroketo get the product …], they don’t take in.
Brian: I like that idea of the design, if you can build those kinds of things in and again over time by minimizing. You don’t have to hit people over the head with it. It’s knowing the questions that your customer might ask but are the ones that you might need to fill in. It’s not every deficiency. It’s just the ones that may be a friction point for them like, “Hey, am I really going to pull the trigger based on this insight?” Is there some answer you can give them or some information that you can give them to make them feel comfortable with the decision support that the tool is generating? Like in your case using imputed data, for example, I think that’s good especially if we can balance the subtlety in the interface there such that it’s not noise, you’re adding additional noise as well. I like that thinking a lot.
Last question. We’re getting close to the clock here on our time. I’m curious to know without getting into the hype of data science and all that, but is there something that you’re excited about in your space that you think, because of the climate we’re in, with the compute power being available, you guys obviously are dealing with a ton of information, you’ve cleaned a lot of it, is there a new place you guys can go with this technology in terms of simplifying the experience for your customers? Whether it’s a new feature or, “Hey, we’re going to be able to cut out a whole section of the product because this technology is going to allow us to do X now and we could never do that before.” Just curious. What’s the next journey look like? Is there something new that’s going to be enabled or is it more of a slow crawl and the toolsets get better along the way? It’s still going to be a house but we don’t erect the walls the same way. They may not see how we erect the walls but for us internally it’s easier. Can you just kind of speak openly to that?
Jim: I’ll go back to the user again. I think the really, really tough nut to crack when building a product is finding the product market fit and really finding the key nugget or nuggets that are going to answer questions and solve pain for your user. All the stuff that we’re working with, data, different data science frameworks, et cetera, these are great tools and these tools are getting better. Not essentially accelerating the pace of development, but I don’t actually think it’s unlocking, say for a few key use cases like autonomous vehicles and some other use cases that really were very difficult before but now are actually possible.
For those use cases, I think the fundamental challenge that we all face is really just how do we use this ever increasing and improving arsenal of tools that we have at our disposal from AI, data visualization, and really fast parallel analytic capabilities on the backend? How do we cobble together a product that really meets the user’s needs? I see that as we gain more inches, not feet. It’s going to be such a fundamental problem, I think it’s not going to go away, maybe ever.
Brian: You hit on a topic I’ve talked about on this show many times before. It’s not a magic bullet. All these technologies, machine learning, and whether it’s faster computing power, whatever it may be, most of these things are not the just take the pill, swallow it, and then bam, instant new business value. Now, run out and find a place to go use this new tool. That’s not necessarily going to save you.
You really got to understand the problem space and know how to deploy the technology properly because at the end of the day, they’re still going to log into this interface that’s running in a browser, they’re going to pull up some information, somehow they’re going to receive pixels and ink on a screen telling them something. How you guys did all that in the backend and the technology that went into it, that may change over time, it may get better, it could be accuracy-improving, but it’s not usually a magic bullet. You kind of reiterated that. Typically, I hear it. I think it’s important with all the hype that’s going on that’s it not like, “Oh my God, we don’t even need an interface anymore.”
Jim: To be fair, I am a technologist at heart, I love this stuff, and I think it’s super exciting. I just think that it’s very easy to get wrapped up in the technology and have that overshadow what the most important thing we’re here for.
Brian: Cool. This has been super fun. We’ve been talking to Jim Psota from Panjiva. It’s now part of S&P. Can you tell people where to find you online? Are you on Twitter or LinkedIn? Is there a place where people can learn more about you and the company?
Jim: Yes. The website is panjiva.com. You can find me on Twitter @jimpsota and drop me a line LinkedIn as well.
Brian: Great. I will put the links in the show notes. It’s been great to talk to you again. Again, congratulations on the acquisition and props from Fast Company about data science, that’s really great. Hope we get the chance to talk again soon.
Jim: Always a pleasure, Brian. Thank you.