Thanks for being here!
Announcement(s):
#1 - Our transaction data is now available at more granular “row level”.
#2 - Now have single merchant gift card data available. Yes, really!
Contact me to learn more.
Theme that emerged in this week’s email is … Synthetic data getting a lot of attention these days.
QUOTES
“Gartner has estimated that 60% of the data used in artificial intelligence and analytics projects will be synthetically generated by 2024. Synthetic data offers numerous value propositions for enterprises, including its ability to fill gaps in real-world data sets and replace historical data that’s obsolete or otherwise no longer useful.”
“Synthetic data — which resembles real data sets but doesn’t compromise privacy — allows companies to share data and create algorithms more easily.”
News Articles
Podcasts
Cool Charts
Final Thoughts (Where is the opportunity in data?)
#1 – Jonathan Chin from Facteus asked ChatGPT some pretty pointed questions on the Alternative Data space. February 2023.
My Take: This was a fun & interesting way to open a discussion about how Chat GPT might impact the alternative data space. Of most interest to me was the commentary about the next 5 years (data privacy focus … see synthetic data articles highlighted this week). & then how Chat GPT might integrate into the Alt Data space, namely the programs ability to analyze & make sense of large quantities of data … and then tell a good story. Something that remains a challenge.
#2 – Brian Eastwood of MIT’s Sloan School published What is synthetic data — and how can it help you competitively?. January 2023.
My Take: Bottom line, synthetic data solves many of the privacy issues that face data practitioners. Studies have shown that there is “no significant difference” between predictive models using synthetic data relative to those models using real data. There will always be “reidentification” risks, but frankly, most use cases in the investment space have no need for any specific individual’s information. Challenges using synthetic data include missing outliers, questions on quality (justified or not), user acceptance (justified or not), extra time, cost, & effort…among others. Further reading on synthetic data from Cem Dilmegani highlighted below (charts section).
#3 – Matthew Bernath published The Power of Data Ecosystems. February 2023.
My Take: Matthew highlights the 5 reasons why data sharing is so powerful: 1- Improved decision making, 2- Enhanced customer experience, 3- Increase efficiency, 4- Better collaboration, 5- Enhanced trust. It all makes sense! Execution is the hard part!
BONUS: Related to all this talk of synthetic data & data privacy: SnowFlake acquired Leapyear.
“Differential privacy is essentially a term used to describe systems that allow users to share information about a data set by describing patterns within the data set, without sharing any individual or personally identifiable information from the data set.”
BONUS 2: Great title from Benn Stancil: The insight industrial complex. February 2023.
“Insights are grandiose enough to sound valuable, and amorphous enough to avoid actually saying what that value is. It promises buyers the world without promising anything at all.“
#1 – My First Million podcast published Data-Based Businesses, Executive Briefings, and Spotting Scams with Anand Sanwal. January 2023. (h/t Jason P for the recommendation)
My Take: I really like “idea people” … folks that are just idea machines … add in the ability to execute on the idea & this is Anand Sanwal. He has built CB Insights into a $100m revenue business. A couple things were of most interest to me. This idea that most companies are not good at monetizing their own “sawdust data” (AKA I call this “exhaust data”). This is true, I deal with a lot of companies that want to explore data monetization, but most a unwilling to do the hard work of getting the data ready to go.
I appreciate the framework Anand provided about finding data businesses that have barriers to entry. Seek tiny but growing niches. Framework … if it meets these three, you can build a moat around it:
High consideration (someone is putting a lot of money out there)
Opaque (hard to find data)
Variable SKU (can’t just put it into a table)
Highlights (63-minute run time):
Minute 01:15 – Anand’s background
Minute 03:30 – discussion of CB Insights being a good business model
Minute 04:00 – “Sawdust data” with examples (like “exhaust data”)
Minute 06:00 – how to make messy data into clean data & sell it; framework
Minute 08:00 – Seek tiny but growing niches; “representation media” (spitballing session)
Minute 16:30 – Build once, sell multiple times … these are good businesses
Minute 18:20 – “In God we trust, all others must bring data”
Minute 21:45 – discussion of pricing
Minute 24:00 – college ranking … “broke & busted”
Minute 28:30 – having access to private data is going to be increasingly valuable (esp w/ ChatGPT)
Minute 29:30 – you want the data that is hard to get & changes frequently
Minute 33:30 – Anand is (foolishly or brilliantly) heavily long CB Insights
Minute 35:15 – Anand didn’t want to work for his investors; discussion of conservative growth model
Minute 37:30 – education is broken; how Anand would fix
Minute 41:00 – how Anand ran and sold a company in India
Minute 47:30 – discussion of Adani group& spotting frauds (red flags are convoluted org structures)
Source: Cem Dilmegani authored What is Synthetic Data? Use Cases & Benefits in 2023. Originally written 2018, update January 2023.
Source: My brain.
Let’s talk about data.
I have been thinking about the evolution of our “data industry”. It is frenetic. Huge opportunity for those with foresight & are nimble enough to adjust quickly. With the struggle to organize data, the grind of long sales cycles & the changing compliance ecosystem, it can be exhausting. But we are clearly seeing some demonstrable benefits from “doing data well”, and with that … more prospective clients are trying to figure it out. We’ll get there.
So where is the opportunity in data?
Collection & Storage (2015-2020)
Vastly more data being created
Advances in cloud to outsource storage … major step function easier & cheaper than previous solutions (essentially in-house)
Where does value flow?
o AWS, GCP, Azure … others
Organizing & Manipulating (2020-2025)
How to process and organize all this data?
What are the use cases and how do data users unlock value?
Where does value flow?
o SnowFlake, DataBricks, & tools for ingesting, organizing, testing, QA-ing, privatizing, ETL, ELT, and other acronyms.
Owning (2025-2030)
Now that we are better at storing the data, organizing the data, and manipulating the data … the value will be owning the data inputs.
Where does the value flow?
o 90 West-like firms 😊, owners of first party data, large corporations that recognize market value of their own “exhaust data” … maybe even the individual person will see some direct monetary benefit (certainly there will be indirect benefit for the individual).
Potential fly in the ointment is synthetic data (theme this week) … perhaps all analysis will be done with fake data, but I am skeptical, but then, I am a data owner.
I’ve been toying with this analogy (feedback welcome):
Early movies were just traditional theater placed on a screen … it took years before better movies were created and TV was in every household.
Early internet was just traditional marketing docs put on computer screen … it took years before tools improved and killer apps like the iPhone unleashed the true power.
Early data practitioners just store and struggle to organize rows & columns (like we used to do & still do in Excel) … we are in the midst of major changes that will revolutionize how we engage with data. Like all good tech, it will be like magic … what will be the killer app?