Thanks for being here!
Announcement(s):
Check out VK’s updated Core Collection List. Our Core Collection List contains hundreds of actionable, targeted datasets covering a broad range of industries, sectors, and geographies.
Vertical Knowledge will have a presence at two upcoming conferences:
- Eagle Alpha’s January 18th data conference in NY
- BattleFin’s January 24-26th data conference in Miami
Let me know if you plan to be in attendance.
Theme that emerged in this week’s email is … “Good” data is an essential input to AI models.
QUOTES
“With reduced barriers to building AI applications on LLMs, data is arguably the most important currency in building a differentiated position.” – Christine Kim, Greylock
News Articles
Podcasts
Cool Charts
Final Thoughts (Barbie)
#1 – Christine Kim of Greylock published Vertical AI: Why a Vertical Approach is Key to Building Enduring AI Applications. December 2023.
My Take: I really enjoyed this thoughtful article. AI is providing us with a huge opportunity (investment, societal, productivity), I have struggled to articulate the opportunity and think this article does as good a job as any. The role data will play is central to ay AI related business. The first businesses that will be created are going to narrowly focus on disrupting a very specific market where AI powered products will allow 10x efficiency (ie basic legal work, basic analyzing of financial reports, etc … will all become much more efficient). “Good” data sits at the core of many of these new business models.
#2 – Maggie Harrison published AI Loses Its Mind After Being Trained on AI-Generated Data. July 2023.
My Take: This article refers to a Rice University report, cited here. This article relates to our theme of the week about “good” data being essential as an AI input. AI created data is not considered “good data”. Only the originally created data (otherwise the models “go MAD” … Model Autophagy Disorder)…the summary of the reports conclusion is that “either the quality (precision) or the diversity (recall) of the generative models decreases over generations”. There is a compounding feature of this as well,
Both The Conversation & Appen published similar articles, Researchers warn we could run out of data to train AI by 2026. What then?, November 2023, & The Impending Data Crisis in the AI Economy, December 2023.
#3 – Ethan Mollick’s An Opinionated Guide to Which AI to Use: ChatGPT Anniversary Edition, A simple answer, and then a less simple one.. December 2023.
My Take: If, even after reading all the doomsday ‘run out of data’ articles above, if you still want to use AI, Ethan offers us some suggestions. The easy answer is to use Bing (yes, that Bing). The less simple answer is, well, less simple & I’d suggest reading the article. But my not so helpful summary answer is it depends on your use case & the amount of time/effort you want to put into it. In any case, the models you are using today will be the worst you will ever use as whatever comes out tomorrow will be much better.
BONUS: Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning. Oct 2022. “So far we have found that data stocks grow much slower than training dataset sizes (see Figures 3c, 4c, and 5c). This means that exhausting our data stocks is inevitable if current trends continue. In addition, the high-quality data stock is much smaller than the low-quality stock.”
What else I am reading:
Seth Rosenberg’s Product-Led AI. September 2023.
Tam Harbert’s Tapping the power of unstructured data. Feb 2021.
Herb Greenberg’s Special Report – Backdoor Play on AI. December 2023 (via Matt Ober).
Doug Laney’s How Well Is Data Fueling Your Company’s Digital Revolution?. December 2023.
McKinsey’s 2023: The Year in Charts.
Source: Deconstructing Data Podcast interviews CMO Bright Data Yanay Sela. November 2023.
My Take: This was an interesting conversation about the power of publicly available web data as it is used for marketing, Yanay sees this as particularly valuable when combined with internal data.
“any data that is on the internet and is public by design”. Is there public web data out there that informs the decision I am making? Most everyone would answer that question in the positive & it really highlights the Importance of asking the right questions (Sam Walton example, Minute 20:00)
In past 2 years, Yanay has really seen how web data changes the decisions of marketeers (16:40); more data changes the way you ask questions.
20:45 - “today it is about asking the right questions”
Highlights (45-minute run time):
Minute 02:00 – interview starts and background
Minute 03:30 – bright data overview & background
Minute 05:00 – AI and marketing; google ads; PMax
Minute 08:00 – discuss of Google Performance Max (p max); the importance of scale
Minute 12:00 – focus on ROI of ad spend (google created advertising black box)
Minute 14:45 – what is public data and how are marketers using it?
Minute 21:00 – what about AI and not having reference point? Source attribution?
Minute 24:30 – AI models will compete against each other
Minute 26:00 – example of zip code analysis & attribution
Minute 28:15 – importance of articulating the right questions
Minute 30:40 – public web data & AI are best friends
Minute 33:40 – will public data remain in public?
Minute 37:00 – Reddit example of limiting data access
Source: Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning. October 2022.
Authors: Pablo Villalobos, Jaime Sevilla, Lennart Heim, Tamay Besiroglu, Marius Hobbhahn, Anson Ho.
Running out of “high quality” data around 2024:
Running out of “low quality” data around 2040:
BONUS: Via Herb Greenberg’s Special Report – Backdoor Play on AI. December 2023. (Will “AI-washing” be the new “green-washing”)
BONUS 2: The Data Engineering Lifecycle. Source: Ken Zockoll
Barbie Movie drives interest in Barbie products.
This is an example of how Vertical Knowledge’s Bestseller data can help you track brand interest. Below chart shows the number of Barbie-related products listed among the Top 100 bestsellers within the Toys & Games>Dolls & Accessories category on Amazon.
We track 1,700+ categories.