A "poisoned" image of a cow with wheels instead of legs.

Nightshade, the tool that ‘poisons’ data, gives artists a fighting chance against AI

A "poisoned" image of a cow with wheels instead of legs.

Image Credits: Bryce Durbin/TechCrunch

Intentionally poisoning someone else is never morally right. But if someone in the office keeps swiping your lunch, wouldn’t you resort to petty vengeance?

For artists, protecting work from being used to train AI models without consent is an uphill battle. Opt-out requests and do-not-scrape codes rely on AI companies to engage in good faith, but those motivated by profit over privacy can easily disregard such measures. Sequestering themselves offline isn’t an option for most artists, who rely on social media exposure for commissions and other work opportunities. 

Nightshade, a project from the University of Chicago, gives artists some recourse by “poisoning” image data, rendering it useless or disruptive to AI model training. Ben Zhao, a computer science professor who led the project, compared Nightshade to “putting hot sauce in your lunch so it doesn’t get stolen from the workplace fridge.” 

“We’re showing the fact that generative models in general, no pun intended, are just models. Nightshade itself is not meant as an end-all, extremely powerful weapon to kill these companies,” Zhao said. “Nightshade shows that these models are vulnerable and there are ways to attack. What it means is that there are ways for content owners to provide harder returns than writing Congress or complaining via email or social media.” 

Zhao and his team aren’t trying to take down Big AI — they’re just trying to force tech giants to pay for licensed work, instead of training AI models on scraped images. 

“There is a right way of doing this,” he continued. “The real issue here is about consent, is about compensation. We are just giving content creators a way to push back against unauthorized training.” 

Left: The Mona Lisa, unaltered. Middle: The Mona Lisa, after Nightshade Right: AI sees the shaded version as a cat in a robe.
Left: The Mona Lisa, unaltered. Middle: The Mona Lisa, after Nightshade. Right: How AI “sees” the shaded version of the Mona Lisa. Image Credits: Courtesy of University of Chicago researchers

Nightshade targets the associations between text prompts, subtly changing the pixels in images to trick AI models into interpreting a completely different image than what a human viewer would see. Models will incorrectly categorize features of “shaded” images, and if they’re trained on a sufficient amount of “poisoned” data, they’ll start to generate images completely unrelated to the corresponding prompts. It can take fewer than 100 “poisoned” samples to corrupt a Stable Diffusion prompt, the researchers write in a technical paper currently under peer review.

Take, for example, a painting of a cow lounging in a meadow.

“By manipulating and effectively distorting that association, you can make the models think that cows have four round wheels and a bumper and a trunk,” Zhao told TechCrunch. “And when they are prompted to produce a cow, they will produce a large Ford truck instead of a cow.”

The Nightshade team provided other examples, too. An unaltered image of the Mona Lisa and a shaded version are virtually identical to humans, but instead of interpreting the “poisoned” sample as a portrait of a woman, AI will “see” it as a cat wearing a robe. 

Prompting an AI to generate an image of a dog, after the model was trained using shaded images that made it see cats, yields horrifying hybrids that bear no resemblance to either animal. 

AI-generated hybrid animals
It takes fewer than 100 poisoned images to start corrupting prompts. Image Credits: Courtesy of University of Chicago researchers

The effects bleed through to related concepts, the technical paper noted. Shaded samples that corrupted the prompt “fantasy art” also affected prompts for “dragon” and “Michael Whelan,” who is an illustrator specializing in fantasy and sci-fi cover art. 

Zhao also led the team that created Glaze, a cloaking tool that distorts how AI models “see” and determine artistic style, preventing it from imitating artists’ unique work. Like with Nightshade, a person might view a “glazed” realistic charcoal portrait, but an AI model will see it as an abstract painting — and then generate messy abstract paintings when it’s prompted to generate fine charcoal portraits. 

Speaking to TechCrunch after the tool launched last year, Zhao described Glaze as a technical attack being used as a defense. While Nightshade isn’t an “outright attack,” Zhao told TechCrunch more recently, it’s still taking the offensive against predatory AI companies that disregard opt outs. OpenAI — one of the companies facing a class action lawsuit for allegedly violating copyright law — now allows artists to opt out of being used to train future models. 

“The problem with this [opt-out requests] is that it is the softest, squishiest type of request possible. There’s no enforcement, there’s no holding any company to their word,” Zhao said. “There are plenty of companies who are flying below the radar, that are much smaller than OpenAI, and they have no boundaries. They have absolutely no reason to abide by those opt out lists, and they can still take your content and do whatever they wish.” 

Kelly McKernan, an artist who’s part of the class action lawsuit against Stability AI, Midjourney and DeviantArt, posted an example of their shaded and glazed painting on X. The painting depicts a woman tangled in neon veins, as pixelated lookalikes feed off of her. It represents generative AI “cannibalizing the authentic voice of human creatives,” McKernan wrote.

McKernan began scrolling past images with striking similarities to their own paintings in 2022, as AI image generators launched to the public. When they found that over 50 of their pieces had been scraped and used to train AI models, they lost all interest in creating more art, they told TechCrunch. They even found their signature in AI-generated content. Using Nightshade, they said, is a protective measure until adequate regulation exists. 

“It’s like there’s a bad storm outside, and I still have to go to work, so I’m going to protect myself and use a clear umbrella to see where I’m going,” McKernan said. “It’s not convenient and I’m not going to stop the storm, but it’s going to help me get through to whatever the other side looks like. And it sends a message to these companies that just take and take and take, with no repercussions whatsoever, that we will fight back.” 

Most of the alterations that Nightshade makes should be invisible to the human eye, but the team does note that the “shading” is more visible on images with flat colors and smooth backgrounds. The tool, which is free to download, is also available in a low-intensity setting to preserve visual quality. McKernan said that although they could tell that their image was altered after using Glaze and Nightshade, because they’re the artist who painted it, it’s “almost imperceptible.” 

Illustrator Christopher Bretz demonstrated Nightshade’s effect on one of his pieces, posting the results on X. Running an image through Nightshade’s lowest and default setting had little impact on the illustration, but changes were obvious at higher settings.

“I have been experimenting with Nightshade all week, and I plan to run any new work and much of my older online portfolio through it,” Bretz told TechCrunch. “I know a number of digital artists that have refrained from putting new art up for some time and I hope this tool will give them the confidence to start sharing again.”

Ideally, artists should use both Glaze and Nightshade before sharing their work online, the team wrote in a blog post. The team is still testing how Glaze and Nightshade interact on the same image, and plans to release an integrated, single tool that does both. In the meantime, they recommend using Nightshade first, and then Glaze to minimize visible effects. The team urges against posting artwork that has only been shaded, not glazed, as Nightshade doesn’t protect artists from mimicry. 

Signatures and watermarks — even those added to an image’s metadata — are “brittle” and can be removed if the image is altered. The changes that Nightshade makes will remain through cropping, compressing, screenshotting or editing, because they modify the pixels that make up an image. Even a photo of a screen displaying a shaded image will be disruptive to model training, Zhao said. 

As generative models become more sophisticated, artists face mounting pressure to protect their work and fight scraping. Steg.AI and Imatag help creators establish ownership of their images by applying watermarks that are imperceptible to the human eye, though neither promises to protect users from unscrupulous scraping. The “No AI” Watermark Generator, released last year, applies watermarks that label human-made work as AI-generated, in hopes that datasets used to train future models will filter out AI-generated images. There’s also Kudurru, a tool from Spawning.ai, which identifies and tracks scrapers’ IP addresses. Website owners can block the flagged IP addresses, or choose to send a different image back, like a middle finger.

Kin.art, another tool that launched this week, takes a different approach. Unlike Nightshade and other programs that cryptographically modify an image, Kin masks parts of the image and swaps its meta tags, making it more difficult to use in model training. 

Nightshade’s critics claim that the program is a “virus,” or complain that using it will “hurt the open source community.” In a screenshot posted on Reddit in the months before Nightshade’s release, a Discord user accused Nightshade of “cyber warfare/terrorism.” Another Reddit user who inadvertently went viral on X questioned Nightshade’s legality, comparing it to “hacking a vulnerable computer system to disrupt its operation.”

Believing that Nightshade is illegal because it is “intentionally disrupting the intended purpose” of a generative AI model, as OP states, is absurd. Zhao asserted that Nightshade is perfectly legal. It’s not “magically hopping into model training pipelines and then killing everyone,” Zhao said. The model trainers are voluntarily scraping images, both shaded and not, and AI companies are profiting off of it. 

The ultimate goal of Glaze and Nightshade is to incur an “incremental price” on each piece of data scraped without permission, until training models on unlicensed data is no longer tenable. Ideally, companies will have to license uncorrupted images to train their models, ensuring that artists give consent and are compensated for their work. 

It’s been done before; Getty Images and Nvidia recently launched a generative AI tool entirely trained using Getty’s extensive library of stock photos. Subscribing customers pay a fee determined by how many photos they want to generate, and photographers whose work was used to train the model receive a portion of the subscription revenue. Payouts are determined by how much of the photographer’s content was contributed to the training set, and the “performance of that content over time,” Wired reported. 

Zhao clarified that he isn’t anti-AI, and pointed out that AI has immensely useful applications that aren’t so ethically fraught. In the world of academia and scientific research, advancements in AI are cause for celebration. While most of the marketing hype and panic around AI really refers to generative AI, traditional AI has been used to develop new medications and combat climate change, he said. 

“None of these things require generative AI. None of these things require pretty pictures, or make up facts, or have a user interface between you and the AI,” Zhao said. “It’s not a core part for most fundamental AI technologies. But it is the case that these things interface so easily with people. Big Tech has really grabbed on to this as an easy way to make profit and engage a much wider portion of the population, as compared to a more scientific AI that actually has fundamental, breakthrough capabilities and amazing applications.”

The major players in tech, whose funding and resources dwarf those of academia, are largely pro-AI. They have no incentive to fund projects that are disruptive and yield no financial gain. Zhao is staunchly opposed to monetizing Glaze and Nightshade, or ever selling the projects’ IP to a startup or corporation. Artists like McKernan are grateful to have a reprieve from subscription fees, which are nearly ubiquitous across software used in creative industries.

“Artists, myself included, are feeling just exploited at every turn,” McKernan said. “So when something is given to us freely as a resource, I know we’re appreciative.’ 

The team behind Nightshade, which consists of Zhao, PhD student Shawn Shan, and several grad students, has been funded by the university, traditional foundations and government grants. But to sustain research, Zhao acknowledged that the team will likely have to figure out a “nonprofit structure” and work with arts foundations. He added that the team still has a “few more tricks” up their sleeves. 

“For a long time research was done for the sake of research, expanding human knowledge. But I think something like this, there is an ethical line,” Zhao said. “The research for this matters . . . those who are most vulnerable to this, they tend to be the most creative, and they tend to have the least support in terms of resources. It’s not a fair fight. That’s why we’re doing what we can to help balance the battlefield.” 

Kin.art launches free tool to prevent GenAI models from training on artwork

Selkie founder defends use of AI in new dress collection amid backlash

High-angle view of yellow padlocks on yellow background. One of them is open.

HopSkipDrive says personal data of 155,000 drivers stolen in data breach

High-angle view of yellow padlocks on yellow background. One of them is open.

Image Credits: Javier Zayas Photography / Getty Images

Student rideshare startup HopSkipDrive has confirmed a data breach involving the personal data of more than 155,000 drivers.

Los Angeles-based HopSkipDrive offers an Uber-style rideshare service for children and teenagers. The startup, which has raised at least $90 million since it was founded in 2014, partners with school districts to transport students who live outside traditional bus routes or need extra help getting to school.

In a filing with Maine’s attorney general last week, HopSkipDrive confirmed that it had experienced a cybersecurity incident in June that resulted in a data breach affecting 155,394 drivers. HopSkipDrive said the stolen data included names, email and postal addresses, driver license numbers and other non-driver identification card numbers.

HopSkipDrive spokesperson Campbell Millum told TechCrunch that those affected include “people who drive on our platform or who applied to drive on our platform.” Millum added that no employee or customer data was accessed in the breach.

The company confirmed to TechCrunch that it first discovered the breach on June 12, 2023, when it “discovered suspicious activity on certain third-party applications utilized by our organization.” The company declined to name the compromised applications.

In a letter sent to those affected, HopSkipDrive said it first became aware of the issue after receiving an email from an unknown threat actor.

When TechCrunch asked why it took the company months to notify affected drivers, HopSkipDrive’s spokesperson rebuffed claims of a delay in the company’s communications, adding that the company first notified affected individuals in the first week of July and has “continued communications since then.”

“We promptly launched an investigation, engaged experts to assist in assessing the scope of the incident, and took steps to mitigate the potential impact to our community,” the letter sent to affected drivers reads. “A third-party forensic investigation determined the incident occurred between May 31, 2023 and June 10, 2023.”

HopSkipDrive said it is “committed to strengthening our systems’ security to prevent a similar event from occurring again in the future,” but did not elaborate on what additional safeguards it is implementing.

TechCrunch asked HopSkipDrive, whose leadership page does not list a chief security officer, if it has a company executive dedicated to handling cybersecurity at the company. HopSkipDrive said it has “information security experts on both our legal and our technology teams.”

How a pivot helped HopSkipDrive emerge successful in a sector where many failed

Steve Huffman, CEO of Reddit attends Variety & Reddit An Evening With Future Makers at Wynn Las Vegas on January 05, 2023 in Las Vegas, Nevada.

Social networks are getting stingy with their data, leaving third-party developers in the lurch

Steve Huffman, CEO of Reddit attends Variety & Reddit An Evening With Future Makers at Wynn Las Vegas on January 05, 2023 in Las Vegas, Nevada.

Image Credits: Greg Doherty / Variety / Getty Images

2023 was the year social networks realized that they were sitting on massive troves of data. And some companies, such as Twitter (now X) and Reddit, decided to change their terms to shut out third-party experiences on these platforms. In the process, they also put a price on their data — something they believe is highly valuable today as it can be used to train AI models.

After talking to several third-party developers who built apps and services on top of these larger social networks, we learned that there are mixed feelings among the developers about building experiences around social networks. While they are excited about the rise of decentralized networks, some of them haven’t seen enough incentives to build out new apps.

The APIcalypse

Twitter has historically been one of the most popular social networks when it comes to third-party apps. Some of those apps even pre-date the name “tweet.” In January, however, the company started to shut out these clients without any notice. Within days, in typical Elon Musk style, the social network silently changed its developer terms to effectively shut down third-party apps.

Twitterrific was one of the earliest Twitter clients. Image Credit: Twitterrific

In the following months, the company shuttered free access to its API and raised prices for other tiers. These new prices and limitations were so high that some third-party experiences, such as automatic alert posting services and research, suffered alongside the third-party clients. It only made sense to develop a paid application for teams and enterprises to bear costs as high as $42,000 per month for enterprise-tier access.

In contrast, Reddit gave notice to developers about changing its API terms in April. However, in May, Christian Selig, the developer of the popular iOS client Apollo, had a call with the company where he learned that the cost demanded by the platform was so high that his app would go out of business. Following that news, many subreddits staged blackouts in protest. Reddit CEO Steve Huffman gave multiple interviews defending the changes made by the company and Reddit started forcing rebel moderators out by removing them from moderator posts.

apollo-reddit-lockscreen
Image Credits: Apollo app image by TechCrunch

As Reddit implemented its API changes, many apps, including Apollo, RIF (Reddit is Fun), ReddPlanet, Sync and BaconReader shut down. Some apps like Narwhal remained available with a subscription model, while RedReader, Dystopia and Luna were exempted from API changes because they provided accessibility features without any commercialization.

Narwhal developer Rick Harrison, who is also the co-founder and CTO of YC-backed dispensary software Meadow, told TechCrunch that he plans to maintain his Reddit client.

“I have continued developing this app because I primarily make this app for myself. I originally made Narwhal before Reddit even had an app and I wanted one for my daily commute. Now, I continue working on it as a side project, because I find it fun, and I have a small community of users who also enjoy the app,” he said.

On the other hand, Selig, who had to stop the development of the more popular Reddit client Apollo, isn’t bullish on continued collaboration between social media companies and developers. He calls that era “a bit of a distant memory.”

“The challenges and the severity of them depend on the host platform and their willingness to have developers on their platform, within those constraints, things can go either very smoothly or with great difficulty,” Selig said.

“For instance with Reddit, at the beginning, they were very communicative with developers and the development of the API, however in recent years they instead chose to use a different API internally and no longer use the ‘traditional’ API that developers depended on, which meant it was effectively left to languish with bugs and missing features.”

Industry observers noted that social networks have gradually closed down their platforms for third-party apps. Tom Coates, the co-founder of an older decentralized app called Planetary, told TechCrunch that social apps were open to allow third-party experiences in the early days of the social web, but over time, platforms like Instagram have proved that closed ecosystems are beneficial.

“Other platforms have been a bit more reliable, but the early spirit of trying to grow the whole ecosystem and to become a good citizen in a wider environment has kind of gone from the corporate social world,” he said.

Social networks want to protect their data from AI

Both Reddit and Twitter put restrictions on their APIs to better monetize the data of their social networks. Apart from introducing new limits on APIs, Musk also limited the number of posts in July 2023 users can view in a day as a measure to “address extreme levels of data scraping.”

Last September, the company updated its policy to ban scraping and crawling. Plus, the updated policy allowed X to train AI models through public data. In November, Musk’s new AI company called xAI launched Grok, an AI chatbot, to leverage that data. Later in December, subscribers of X’s highest-paid tier Premium+ got access to this chatbot.

Reddit also put gates around its data to stop AI companies from using it to train models. In an interview with The New York Times in April, CEO Steven Huffman said that companies need to pay to use Reddit’s data — something that’s valuable not only to AI companies like OpenAI, but also Google Search, which users in recent years have been hacking to get less spammy results by adding “reddit” to their search queries.

“The Reddit corpus of data is really valuable,” Huffman told The NYT. “But we don’t need to give all of that value to some of the largest companies in the world for free,” he said.

Reddit CEO Steve Huffman delivers remarks on “Redesigning Reddit” during the third day of Web Summit in Altice Arena on November 8, 2017 in Lisbon, Portugal. Image Credits: Horacio Villalobos – Corbis / Contributor / Getty Images

Weeks later, when Reddit was at a crossroads with third-party developers, in an interview with The Verge, the CEO of Reddit said that the social network’s API was never meant to support third-party clients.

“[Reddit’s API] was never designed to support third-party apps. We let it exist. And I should take the blame for that because I was the guy arguing for that for a long time. But I didn’t know — and this is my fault — the extent that they were profiting off of our API. That these were not charities,” he noted.

The new wave of social networks

The change in guard at X/Twitter gave birth to new social networks like Bluesky, Spill, Pebble (now discontinued), Post and Meta-owned Instagram Threads. While the decentralized network Mastodon has been around since 2016, it gained new prominence in the social sphere, as well.

After Twitter booted longtime third-party developers, many of them began building Mastodon clients, in fact: Tapbots (which built Tweetbot) introduced a Mastodon app called Ivory; Shihab Mehboob (who developed Aviary) built Mammoth, which is now under different management; and Matteo Villa (who built Fenix) is focusing on Woolly.

Image Credits: Mammoth

Mehboob is bullish about Mastodon’s future and told TechCrunch that the open ecosystem helps developers make different experiences.

“Mastodon is my new favorite when it comes to third-party app support as it’s so open. There are no real restrictions apart from rate limiting and pagination — which exist on all social media apps to prevent abuse of the API. Their documentation is also great,” he said.

Bluesky, another Twitter competitor that has just launched publicly, found some promising traction with third-party developers. Despite the closed ecosystem, developers have started making mobile clients like Graysky and multi-column desktop apps such as Skydeck and Blueskydeck.

Bluesky CEO Jay Graber speaks at the Knight Foundation's Informed conference
Bluesky CEO Jay Graber speaks at the Knight Foundation’s Informed conference. Image Credits: Marco Bello and Eva Marie for the Knight Foundation

Samuel Newman, the developer of Graysky, said that platforms like Bluesky and Mastodon provide the same tools and APIs to third-party devs as the ones they use to build their official apps. However, as companies race to achieve feature parity with existing networks while also trying to remain unique and differentiated from one another, the pace of changes is rapid.

“Bluesky and the AT Protocol [the decentralized protocol used by Bluesky] are very much in active development, meaning things are moving fast and you have to keep up with any changes that might come up,” Newman said. “However, it’s built around a very strict schema, which means I can generally trust what data will be received.”

Image Credits: Graysky

Mastodon and Bluesky seem to have embraced third-party clients for now, but the growth of these social networks will be critical for developers. Third-party clients will need a significant audience of potential users to earn a sustainable income, but Mastodon currently has just 1.5 million monthly active users across instances while Bluesky now has nearly 4 million users after launching publicly this week. In November, Bluesky said that the federation is coming in 2024, which means others can build different servers based on the AT Protocol leading to different servers and services with varied features and moderation rules.

Weaving the Thread(s)

The largest player in this new wave of social networks is Meta’s Threads, which has more than 130 million active users after opening up to EU-based users late last year. The app has started experimenting with ActivityPub integration so users on Mastodon and other compatible social networks can view Threads posts. What’s more, Instagram has started work on a Threads API which will be limited to posting at first.

Isaac Shea, who is working on Threaditor, a draft post saver for Threads, said his biggest worry was platforms copying features from third-party apps rather than acquiring or working with them.

“My biggest worry is not API maintenance — although keeping up with changing APIs can be a pain — but more being Sherlocked by the parent platform, intentionally or not,” he said, using a term that defines a time when Apple bakes features similar to an existing app into its own system. In recent years I have felt a trend away from acquiring or working with third-party developers, in favor of replicating and drowning them,” he added.

The Threads logo on a laptop arranged in the Brookl
The Threads logo on a laptop. Photographer: Gabby Jones/Bloomberg via Getty Images

Though Meta has shown a willingness to work on a Threads API, it’s not clear if it will allow alternative Threads clients to flourish — especially when the company has a track record of shutting down third-party apps made for its social networks. Third-party clients usually don’t feature ads, which is Meta’s main revenue source.

Coates, who is an active advocate of decentralized systems, said that if the Mastodon ecosystem (or any decentralized ecosystem) grows, we will see more third-party services thrive such as moderation companies, scalable hosting companies, CSAM scanning services, identity providers and wallet integrations.

As he wrote in detail on his blog, he noted that Threads starting with ActivityPub integration doesn’t automatically mean that we’ll see a flourishing ecosystem of apps by default.

“Again though, the integration of Threads into this ecosystem doesn’t necessarily equate to a larger market — as Meta may not need to use many of the back-end services, and most likely will not initially allow their users to use alternative clients,” Coates said.

“On the other hand, Meta’s integration may cause a lot more interest in self-hosting and other companies and communities may join the larger Mastodon ecology and that would increase the opportunities and possibility for services and products of this kind,” he noted.

Money-making through apps

Developers like Selig, who have been burnt by social networks, don’t want to rely too much on social platforms to make money going forward.

“Unless you have a contract that is willing to lock in the terms for at least a brief period of time — Reddit was unwilling to do this, for instance — I wouldn’t rely on them whatsoever. As we’ve seen, platform stewards can change their mind at the drop of a hat with next to no notice to developers,” Selig said. “However, if you’re able to establish a good relationship with a platform — built on writing, not just promises — it can be a great opportunity that is a lot of fun to work on.”

Shea and Newman’s apps are early into their journeys. While Threaditor doesn’t have a paid model yet, Graysky has just started offering a premium subscription.

Threaditor helps you write long threads and save them. Image Credits: Threaditor

Bart Decrem, who acquired the app, and now continues to develop Mammoth, admits that platforms often disappoint developers. But he’s hopeful because of the open nature of Mastodon. 

“You’re far better off as a third-party developer if you’re building on an open protocol and/or open source project. ActivityPub and Mastodon are a perfect example. But of course, one famous example is email, with a thriving developer ecosystem,” he said.

The Iconfactory’s principal developer Ged Maheux told TechCrunch that the company has learned to diversify its revenue across different apps after the bad experience of Twitterific’s shutdown.

“We learned this lesson the hard way, of course, pinning so much of our revenue on an app that hinged on API access controlled by a third party. Since then we’ve been working hard to spread our eggs across multiple baskets, which is good advice for anyone developing for social I think,” Maheux said.

project tapestry splash screen
Image Credits: The Iconfactory

Earlier this week, Maheux and Iconfactory began a new Kickstarter campaign for a new app called Tapestry, which will allow you to connect your social media accounts and RSS feeds through a chronological timeline. Naturally, there is no support for X in the app.

Legacy social networks closing their doors to developers have given way to a new brand of clients — largely focusing on decentralized platforms. Social platforms like Mastodon and Bluesky have promised that they will support the development of alternative experiences. For users, it’s hard to keep up with new social networks springing up frequently. Their best bet is that developers trying to create bridges between these social networks by trying to implement cross-posting or bringing all feeds together will be able to do so without platforms shutting them down.

If you work at any of the social networks, or a developer trying to build something around social media platforms I’d love to hear from you on [email protected].

Drawing of a cloud on a blue background with arrows going in and out of the cloud to show the syncing concept.

Artie helps companies put data to work faster with real-time syncing

Drawing of a cloud on a blue background with arrows going in and out of the cloud to show the syncing concept.

Image Credits: Khanchit Khirisutchalual / Getty Images

One of the big problems companies face as they try to use data to solve business problems has been the lag between the time the data comes in and when it’s actually useful for applications. In other words moving it from the database like Postgres into a data warehouse like Snowflake.

It’s a problem that Artie, an early-stage startup from a husband and wife founding team, wants to solve. And today the Summer 2023 Y Combinator grads announced a $3.3 million seed investment.

CEO Jacqueline Cheong launched the company with her husband CTO Robin Tang in 2023. “Under the hood, we use Change Data Capture (CDC) and stream processing — so we use Kafka (open source data streaming software) — and that way we can perform data syncs in a super reliable, non-intrusive and hyper efficient way,” Cheong told TechCrunch.

That approach means there is very low latency and that enables customers to put their data to work much faster. She says this also helps optimize compute costs because they essentially only work with the data that has changed.

The couple decided to launch a company together to solve a problem around getting data from the database to the data storage repository with a minimum of fuss. Most companies have to wrangle the data before they use it, performing ETL (extract, transform, load) to get the data into a form where you can use it — and all that takes valuable time.

The typical trade-off for data wrangling has been between speed and cost. If you wanted it cheaper, it took more time. If you wanted it faster, it cost more. Cheong and Tang believe that they have found a happy medium where they can process it faster and cheaper. Their method allows customers to process smaller amounts of data instead of ingesting bulk data regularly throughout the day, thereby increasing speed while lowering the cost.

They launched the company at the beginning of last year and had an initial cloud version of the product by April. They started building a group of open source components, which will remain open source under the Elastic V2 license. “So essentially, what that means is that any company can use [the open source pieces] internally, and we’re not going to change that, but the only thing that’s restricted is basically you can’t use that to sell and commercialize our product.”

Today, the company has 10 paying enterprise customers and four employees, including the two founders. They are working in a small office in San Francisco as Cheong believes it facilitates collaboration in the early-stage company, as well as creating separation between work and home for them as a couple.

The $3.3 million seed round closed at the end of September, but it is just being announced today. It was led by Exponent Founders Capital with participation from General Catalyst, Y Combinator and several industry angel investors.

Meta's 'consent or pay' data grab in Europe faces new complaints

Image Credits: AMY OSBORNE / Getty Images

A controversial move by Meta last year, when it switched to charging users in the European Union for an ad-free subscription to Facebook and/or Instagram unless they agreed to be tracked and profiled so it could keep running its attention-mining microtargeting ad business, has triggered a set of complaints from consumer rights groups. The complaints are being brought under the bloc’s data protection rules.

Currently, Meta charges regional users €9.99/month on web (or €12.99/month on mobile) to opt out of seeing any adverts per linked Facebook and Instagram account. The only other choice EU users have if they want to access Facebook and Instagram is to agree to its tracking — meaning the offer is to literally pay for privacy, or “pay” for free access by losing your privacy.

Eight consumer rights groups from across the region are filing complaints with national data protection authorities against this “consent or pay” choice, the European consumer organization, BEUC — which is a membership and coordinating body for the groups — announced today.

“It is crucial that any consent provided by consumers is valid and meets the high bar set by the law, which requires such consent to be free, specific, informed and unambiguous. This is not the case with Meta’s ‘pay-or-consent’ model,” they argue in a paper about the complaint, which goes on to suggest Meta is seeking “to coerce consumers into accepting its processing of their personal data.”

“Meta keeps consumers in the dark about its data processing, making it impossible for the consumer to know how the processing changes if they choose one option or the other. The company also fails to show that the fee it imposes on consumers who do not consent is indeed necessary, which is a requirement stipulated by the Court of Justice of the EU,” they also write, adding: “Under these circumstances, the choice about how consumers want their data to be processed becomes meaningless and is therefore not free.”

The eight consumer groups*, located in the Czech Republic, Denmark, Greece, France, Norway, Slovakia, Slovenia and Spain, argue Meta has no valid legal basis for processing people’s data for ad targeting under the bloc’s General Data Protection Regulation (GDPR) — asserting the company is processing personal data in a way that is “fundamentally incompatible with European data protection law.”

Specifically, they’re accusing Meta of violating the GDPR principles of purpose limitation, data minimization, fair processing and transparency.

Penalties for confirmed breaches of the regulation can reach up to 4% of global annual turnover. More importantly, companies can be ordered to stop unlawful processing — with the potential for regulators to reform privacy-hostile business models.

Commenting in a statement, Ursula Pachl, deputy director general of BEUC, said:

Meta has tried time and time again to justify the massive commercial surveillance it places its users under. Its unfair “pay-or-consent” choice is the company’s latest effort to legalise its business model. But Meta’s offer to consumers is smoke and mirrors to cover up what is, at its core, the same old hoovering up of all kinds of sensitive information about people’s lives which it then monetises through its invasive advertising model. Surveillance-based business models pose all kinds of problems under the GDPR and it’s time for data protection authorities to stop Meta’s unfair data processing and its infringing of people’s fundamental rights.

BEUC said a legal analysis it undertook with members and the data rights law firm, AWO, concluded that Meta’s processing of consumers’ personal data breaches the GDPR in multiple ways. As well as lacking a valid basis, some of the processing for ads “appears to rely invalidly on contract,” the analysis suggests.

The analysis also queries what legal basis Meta relies upon for content personalization — finding this is “not clear” and “there is no way to verify” all of Meta’s profiling for this purpose is both necessary for the relevant contract and consistent with the GDPR principle of data minimization. The same questions are attached to Meta’s profiling for advertising purposes.

It also found Meta’s processing in general is not consistent with the principles of transparency and purpose limitation — highlighting a lack of transparency, unexpected processing, use of a dominant position to force consent, and “switching of legal bases in ways which frustrate the exercise of data subject rights,” which it also said it not consistent with the GDPR principle of fairness.

As we’ve reported before, Meta’s self-serving “consent or cough up” offer is already facing a number of other GDPR complaints, including one brought by privacy rights group noyb that’s focused on the premium price Meta has put on privacy; another is focused on the asymmetry in the choice Meta has devised, which makes it super simple for users to agree to its tracking but a lot more arduous to protect their privacy, including if they wish to change their mind and withdraw previously given consent.

Earlier this month three DPAs also requested that the EU’s regulatory body for data protection, the EDPB, issues an opinion on the legality of consent or pay.

That guidance is still pending. But fresh complaints — and this pincer action by consumer protection and privacy rights groups — could pile pressure on the EU’s data protection regulator not to rubber stamp a tactic privacy campaigners have long warned is a cynical attempt to circumvent the bloc’s data protection rulebook for commercial gain.

Meta has already lost the ability to use other legal bases it had claimed authorized its ads’ processing — following earlier privacy complaints (and a competition challenge). This means obtaining users’ consent is, basically, the last chance for it to continue operating its tracking ads business in the EU, where the law requires a valid legal basis for processing people’s data (the GDPR names six legal bases but the rest aren’t relevant for an adtech business like Meta’s).

If Meta’s latest consent coercion fails, it could — finally — be forced to reform its surveillance business model. As we’ve written before, the stakes are high: for Meta and for web users in Europe.

Today’s complaints are not the first filed against Meta’s consent or pay tactic by consumer protection groups — some of which argue it’s breaching the bloc’s rules on consumer protection, too. Broader, coordinated action from the sector last November saw BEUC and 18 of its member groups filing complaints against what they dubbed “unfair, deceptive and aggressive practices” by Meta that they assert breach the bloc’s consumer protection rules.

European consumer groups band together to fight Meta’s self-serving ad-free sub — branding it ‘unfair’ and ‘illegal’

Those complaints were filed with the CPC, a regional network of consumer protection authorities. If Meta does not engage with the CPC’s process, such as by offering concessions aimed at remedying the groups’ complaints, it could face enforcement action by consumer regulators (which are empowered to issues fines of up to 4% of global turnover).

At the time, the BEUC said it may also look to bring a data protection complaint against Meta’s controversial consent offer — which is the development we’re seeing today.

“Meta must stop any illegal processing of consumers’ personal data, including for the purpose of advertising,” it wrote in a press release. “Any illegally collected personal data must be deleted. In addition, if Meta would like to use consumers’ consent as legal basis for its data processing, it must ensure that this consent is indeed freely given, specific, informed and unambiguous, as required by the law.”

Meta has previously argued its consent or pay offer is lawful under the GDPR. However, its blog post defending the controversial tactic does not make any mention of how it complies with EU consumer protection law.

There’s a further consideration here too: The European Commission oversees enforcement of Meta’s compliance with the Digital Services Act’s (DSA) rules for larger platforms and Digital Markets Act (DMA) — two newer, pan-EU regulations that stipulate consent has to be obtained for processing personal data for ad-targeting purposes. These regulations also ban the use of sensitive personal data or minors’ data for ads and state that consent must be as easy to withdraw as it is to provide. So another very pertinent question, vis-à-vis Meta’s consent or pay offer in the EU, is what the Commission will do?

The EU’s executive is empowered to enforce the DSA and DMA on Meta — which could include issuing corrective orders. Breaches of the DSA can also lead to penalties of up to 6% of annual turnover, while the DMA can see fines as high as 10% (or even higher for repeat offenses).

So while the latest consumer group GDPR complaints against Meta will likely have to wend their way back to the tech giant’s lead data supervisor in the EU, Ireland’s Data Protection Commission, which continues to face criticism over how weakly it enforces the GDPR on Meta and other tech giants, there are a number of other avenues where the company’s consent choice is facing scrutiny. And — potentially — faster and firmer enforcement action too.

*The BEUC members filing GDPR complaints against Meta are CECU, dTest, EKPIZO, Forbrugerrådet Tænk, Forbrukerrådet, Poprad, Spoločnosť ochrany spotrebiteľov (S.O.S.), UFC-Que Choisir and Zveza Potrošnikov Slovenije (ZPS). A ninth consumer group, the Netherlands-based Consumentenbond, is not filing a complaint but will be sending a letter to the Dutch data protection authority, per BEUC.

European digital rights groups say the future of online privacy is on a knife edge

Meta faces another EU privacy challenge over ‘pay for privacy’ consent choice

Meta’s EU ad-free subscription faces early privacy challenge

Satellites in earth orbit over blue tinted earth.

New geospatial data startup streamlines satellite imagery visualization

Satellites in earth orbit over blue tinted earth.

Image Credits: Yuichiro Chino / Getty Images

These days we have a ton of geospatial data coming off the increasing numbers of satellites circling in our atmosphere, but it takes some serious processing power and engineering prowess to turn that data into something useful.

A couple of former Uber engineers, Sina Kashuk and Isaac Brodsky, who helped build the mapping system at Uber, decided to put their knowledge to work on the problem, and what they’ve come up with is a fast, serverless-based product that takes advantage of some open source tooling they’ve built to get that data from the source and into apps where it can be put to work.

Today, the startup emerged from stealth with the Fused platform, and $1 million in pre-seed funding.

Fused is a three-part system. It takes data, fed from satellites, sitting in storage repositories and runs it through its platform to make it usable. The end goal is to create visual representations of things like weather, deforestation or crop data and put the data to work in applications that people use like Excel, Airtable or Notion. The thing that separates it from what came before it is simplicity and processing speed.

Kashuk, who is co-founder and CEO at Fused, says the company took some time to create the current solution. In fact, after leaving Uber, they formed another company in 2019 called Unfolded.ai that was looking at a similar data visualization problem, but was acquired very quickly by Foursquare.

“The main challenge that we had with Unfolded was it was very hard to get to market because the majority of the platform was open source and people were like, ‘why do I have to pay you guys, when I can just use the open source’,” Kashuk told TechCrunch.

That was a business model problem, but they also ran into limitations because they were trying to run a front-end solution locally on a laptop, and as the data scaled it got very slow. As they were thinking about their next business, Kashuk and Brodsky saw this opportunity with data coming off of the growing number of commercial satellites, and they realized that the data processing on the back end remained a challenge, and could be a business.

Specifically, the founders saw the maturation of serverless computing, where the vendor can handle all of the back-end infrastructure, as an opportunity to help customers process this data and put it to use faster and more efficiently. “So the movement of serverless combined with massive amounts of data coming off of the growing commercial satellite industry made us believe that the industry was ready for this kind of shift,” Kashuk said.

The platform is essentially a middleware processing layer that helps turn the geospatial data into something more consumable. It consists of several open source pieces and a serverless processing engine. It is the latter where the company makes money. Each time data hits the API gateway to the serverless back end, Fused gets paid.

Fused Workbench showing data visualization next to data.
Fused Workbench in action. Image Credits: Fused

The entire system is built with Python, which is widely used by data scientists and developers. The system starts with a kind of visualization template called a user-defined function or UDF. This part is open source, enabling the community to create these templates, such as crop yields, and then using other pieces of the Fused platform, they can begin to break down the data such as the amount of wheat produced by U.S. state or any geographical area you wish.

They can then transfer this data to other programs for further analysis or create data visualizations based on the data. One of the big differentiators here is the speed at which Fused creates these visualizations or processes changes to them. Instead of taking hours, it takes seconds, according to the company.

One of the key platform pieces involved in creating the visualizations is the Fused Workbench, which is available starting today. The Workbench, while partially open, only works in conjunction with the back-end API, he says, but lets you interact with the data to see different aspects and see your changes almost instantly.

The company launched last year and has been working with early beta customers before emerging today. The $1 million in pre-seed came from Fontinalis Partners along with various industry angels.

Servers in dark data center room with computers and storage systems

Alphabet spin-off SIP launches Verrus, a data center concept built around battery 'microgrids'

Servers in dark data center room with computers and storage systems

Image Credits: Jasmin Merdan / Getty Images

Sidewalk Infrastructure Partners (SIP) — the Alphabet spinout that focuses on building and backing new approaches to complicated infrastructure problems in areas like power, broadband and waste management — has launched its latest project, a new concept for more flexible data center energy management called Verrus.

Verrus incorporates “microgrids” based on advanced, high-power batteries with software to understand and allocate energy to specific tasks and applications, and it is designed to address some of the power challenges posed by modern computing needs. These include peaks of cloud computing usage and larger projects — such as AI training — that could be bundled into batches distributed across time frames where there is less demand.

Jonathan Winer, the co-founder and co-CEO of SIP, said that the first three data centers designed using Verrus’ architecture will likely be located in Arizona, California and Massachusetts.

The aim is to have these operational in 2026 or 2027. There are as of yet no signed customers, although Winer said that a number of “hyperscalers” — one of whom, Alphabet, remains a primary backer of SIP after spinning it out several years ago — have shown interest in the project for when it does come online and would likely become one of its target segments when seeking investment. (Alongside the new business, SIP is also launching the Data Center Flexibility Initiative to bring stakeholders like energy companies, tech giants and regulators together in the meantime.)

Winer described Verrus as having “gigawatt scale ambitions.” A data center to meet that scale, he estimated, could cost $1 billion to put together, with hundreds of millions of dollars of equity needed to get it off the ground. That is only likely to come after building has started and customers begin to sign up, he said. In addition to Alphabet, others that back SIP currently include Ontario Teachers’ and StepStone.

Winer said that SIP has been developing the project in stealth mode for almost two years already, and that it was an offshoot of other research that it had been doing into electricity grid management, coupled with SIP’s work with companies focused on load shifting to better manage energy consumption. Observing the strain that data centers in particular have on the electrical grid, SIP turned its attention to those data centers themselves.

The explosion of cloud computing and AI data computations present “a real challenge on the grid,” he said, and typically data centers are at capacity. “In order to add what we’re going to need both of the AI challenge and just general cloud compute there’s going to have to be a new approach to energy management,” he noted. Simply building more data centers, whether run by third-party data center operators or by the hyperscalers themselves, will not keep up with demand. 

Today, extra power comes from diesel generators and using redundant electrical systems within data centers themselves. Verrus’ proposal is to instead use what Winer refers to as a “micro grid” that will include a high-capacity battery, which will mean it will be more flexible to deploy power to specific areas or even projects within a data center.

In turn, this means that an AI training job, as one example, could effectively be paused and batched and run at a different time, versus, say, an enterprise cloud service that might require “five nines” demand response.

As SIP sees it, simply adding more data centers — which has been the approach up to now — is not a sustainable approach longer term.

“The challenge with adding data centers to the grid is not the 340 days a year that the grid isn’t maxed out. The grid is very happy to supply power on the days when it’s not at capacity,” he said. “The real challenge is the 20 days a year where for few hours a day they can’t serve the load.” A flexible energy management system would allow for what he described as “islands” in the data center during those hours.

The approach that Verrus proposes to take underscores how energy usage and consumption continue to evolve in the tech world, but also how energy continues to be a persistent, expensive and ultimately resource-intensive issue that needs as much focus and innovation as the software and hardware that rely on it to evolve.

Verrus is not the only tech company that is eyeing up how to build and use super-capacity battery architecture to manage electricity distribution. Instagrid, a startup out of Germany, recently raised funding for its startup, building batteries to help users with power management in use cases where they are off the grid altogether.

View of main building with logo and signage at the headquarters of professional social networking company LinkedIn

Europe eyes LinkedIn's use of data for ads in another DSA ask

View of main building with logo and signage at the headquarters of professional social networking company LinkedIn

Image Credits: Smith Collection/Gado / Getty Images

Microsoft-owned professional social network, LinkedIn, is the latest to get a formal request for information (RFI) from the EU. The Commission, which oversees larger platforms’ compliance with a subset of risk management, transparency and algorithm accountability rules in its ecommerce rulebook, the Digital Services Act (DSA), is asking questions about LinkedIn’s use of user data for ad targeting.

Of specific concern is whether LinkedIn is breaching the DSA’s prohibition on larger platforms’ use of sensitive data for ad targeting.

Sensitive data under EU law refers to categories of personal data such as health information, political, religious or philosophical views, racial or ethnic origin, sexual orientation and trade union membership. Profiling based on such data to target ads is banned under the law.

The regulation also requires larger platforms (aka VLOPs) to provide users with basic information about the nature and origins of an ad. They must also make an ads archive publicly available and searchable — in a further measure aimed at driving accountability around paid messaging on popular platforms.

In a press release announcing the RFI Thursday, the Commission wrote that it’s asking for “more details on how their service complies with the prohibition of presenting advertisements based on profiling using special categories of personal data”. It also flagged LinkedIn’s requirement to provide users with ad targeting info.

LinkedIn has been given until April 5 to respond to the RFI.

Reached for a response to the Commission’s action, a LinkedIn spokesperson responded by email — stating: “LinkedIn complies with the DSA, including its provisions regarding ad targeting. We look forward to cooperating with the Commission on this matter.”

The RFI represents an early stage in a potential DSA enforcement procedure — suggesting the EU has found issues which are prompting it to ask questions about how LinkedIn adheres to the ban on sensitive data for ads but hasn’t yet established preliminary concerns which would lead it to open a formal investigation. Such a step may follow, though, if it’s not satisfied with the answers it gets.

Compliance is serious business as confirmed violations of the DSA can attract fines of up to 6% of global annual turnover. The DSA also empowers the EU to impose fines for incorrect, incomplete, or misleading information in response to an RFI.

The Commission said its RFI to LinkedIn follows a complaint by civil society organizations, EDRi, Global Witness, Gesellschaft für Freiheitsrechte and Bits of Freedom, back in February — which called for “effective enforcement of the DSA”. 

LinkedIn isn’t the only platform to be in the EU’s spotlight when it comes to use of data for ads. Earlier this month, Meta, the owner of Facebook and Instagram, received an RFI from the Commission asking for more details about how it complies with the DSA’s requirement that use of people’s data for ads needs explicit consent.

A number of other RFIs have also been fired at VLOPs by the EU since the regulation began to apply on them in August last year.

The Commission has said its enforcement is prioritizing action on illegal content/hate speech, child protection, election security and marketplace safety.

Earlier today it announced its first formal investigation of a marketplace, Alibaba’s AliExpress, citing a long list of suspected violations. It also has two open probes of social media sites X and TikTok — raising another string of concerns, such as around illegal content and risk management; and content moderation practices and transparency.

Add to that, today the Commission dialled up scrutiny on how tech giants are responding to risks related to generative AI, such as political deepfakes — sending a bundle of RFIs, including with a eye on the upcoming European Parliament elections in June.

Now the EU is asking questions about Meta’s ‘pay or be tracked’ consent model

EU dials up scrutiny of major platforms over GenAI risks ahead of elections

A United Airlines plane takes off above American Airlines planes on the tarmac at Los Angeles International Airport (LAX) on October 1, 2020

DOT to investigate data security and privacy practices of top US airlines

A United Airlines plane takes off above American Airlines planes on the tarmac at Los Angeles International Airport (LAX) on October 1, 2020

Image Credits: Mario Tama / Getty Images

The U.S. Department of Transportation announced its first industry-wide review of data security and privacy policies across the largest U.S. airlines.

The DOT said in a press release Thursday that the review will examine whether U.S. airline giants are properly protecting their customers’ personal information and whether airlines are “unfairly or deceptively monetizing or sharing that data with third parties.”

Letters to airline executives will include questions about how the airlines collect and handle passengers’ personal information, monetize customer data through targeted advertising and how employees and contractors are trained to handle passenger’s information.

Those airlines include Allegiant, Alaska, American, Delta, Frontier, Hawaiian, JetBlue, Southwest, Spirit and United.

The department, which oversees U.S. government policy on all matters related to transportation, said it would investigate and take enforcement action as it discovers evidence of problematic practices.

U.S. Secretary for Transportation Pete Buttigieg said the review aims to “ensure airlines are being good stewards of sensitive passenger data.”

The DOT did not say what specifically prompted the review, but that the action was part of the U.S. government’s “broader push to protect consumer privacy across the economy.”

In recent months, the U.S. Federal Trade Commission — which regulates consumer data privacy matters — has banned data brokers and other companies from sharing users’ sensitive location and browsing data with others, ordered companies hit by data breaches to overhaul their security practices and pledged to strengthen the federal law known as COPPA that prevents companies from obtaining data on children under the age of 13.

The DOT said that the FTC is “also exploring rules to more broadly crack down on the harms stemming from surveillance and lax data security.”

Transportation Secretary Buttigieg said the DOT’s privacy review will be carried with the expertise and partnership of Sen. Ron Wyden, a senior Democrat who sits on the Senate Intelligence Committee.

Wyden has raised alarms about the sharing and sale of sensitive U.S. consumer data to data brokers — companies that collect and resell people’s personal data, like precise location data, often derived from their phones and computers.

In recent months, Wyden has warned that data brokers sell access to Americans’ personal information, which can identify which websites they visit and the places they travel to. Wyden also warned that U.S. intelligence agencies can — and have — purchased commercially available information about Americans from data brokers, which the intelligence community argues that they don’t need to obtain a search warrant for data they can purchase.

In remarks, Wyden said: “Because consumers will often never know that their personal data was misused or sold to shady data brokers, effective privacy regulation cannot depend on consumer complaints to identify corporate abuses.”

Concept photo depicting data infrastructure startups

AI and data infrastructure drives demand for open source startups

Concept photo depicting data infrastructure startups

Image Credits: Hispanolistic / Getty Images

A new report highlights the demand for startups building open source tools and technologies for the snowballing AI revolution, with the adjacent data infrastructure vertical also heating up.

Runa Capital, a venture capital (VC) firm that left Silicon Valley and moved its HQ to Luxembourg in 2022, has published the Runa Open Source Startup (ROSS) Index for the past four years, shining a light on the fastest-growing commercial open source software (COSS) startups. The company publishes quarterly updates, but last year it produced its first annual report, taking a top-down view of the whole of 2022 — something it’s repeating now for 2023.

Trends

Data is closely aligned with AI because AI relies on data for learning and making predictions, and this requires infrastructure to manage the collection, storage, and processing of that data. And these tangential trends collided in this report.

Hitting top-spot in the ROSS Index for last year was LangChain, a two-year-old San Francisco–based startup that has developed an open source framework for building apps based on large language models (LLMs). The company’s main project passed 72,500 stars in 2023, with Sequoia going on to lead a $25 million Series A round into LangChain just last month.

Top 10 COSS startups in the ROSS Index for 2023
Top 10 COSS startups in the ROSS Index for 2023. Image Credits: Runa Capital

Elsewhere in the top 10 is Reflex, an open source framework for creating web apps in pure Python, with the company behind the product recently securing a $5 million seed investment; AITable, a spreadsheet-based AI chatbot builder and something akin to an open source Airtable competitor; Sismo, a privacy-focused platform that allows users to selectively disclose personal data to applications; HPC-AI, which is building a distributed AI development and deployment platform in a push to become something like the OpenAI of Southeast Asia; and open source vector database Qdrant, which recently secured $28 million to capitalize on the burgeoning AI revolution.

A broader look at the “top 50 trending” open source startups last year reveals that more than half (26) are related to AI and data infrastructure.

Top 50 COSS startups in the ROSS Index for 2023
Top 50 COSS startups in the ROSS Index for 2023. Image Credits: Runa Capital

It’s difficult to properly compare the 2023 index with the previous year from a vertical perspective, due largely to the fact that businesses often pivot or change their product positioning to suit what’s hot today. With the ChatGPT hype train going full throttle last year, this may have led earlier-stage startups to alter their focus, or even just place greater emphasis on the existing “AI” element of their product.

But as generative AI’s breakthrough year, it’s easy to see why demand for open source componentry might skyrocket, as companies of all sizes look to keep apace with proprietary AI juggernauts such as OpenAI, Microsoft, and Google.

Geographies

Open source software has also always been very distributed, with developers from all over the world contributing. This ethos often translates into commercial open source startups that might not have a traditional center of gravity anchored by a brick-and-mortar HQ.

However, the ROSS Index goes some way toward bringing geography into the picture, reporting that 26 companies on the list have an HQ in the U.S., though 10 of these companies originated elsewhere and still have founders or employees based in other locales.

In total, the top 50 hailed from 17 separate countries, with 23 of the companies incorporated in Europe — a 20% rise on the previous year’s index. France counted the most COSS startups with seven, including Sismo and Massa, which are in the top 10, while the U.K. soared from just one startup in 2022 to six in 2023, placing it in second place from a European perspective.

Other notable tidbits to emerge from the report include programming languages — the ROSS Index recorded 12 languages used by the top 50 last year, versus 10 in 2022. But TypeScript, a JavaScript superset developed by Microsoft, remained the most popular, used by 38% of the top 50 startups. Both Python and Rust grew in popularity, with Go and JavaScript dropping.

ROSS Index: Trending programming languages
ROSS Index: Trending programming languages. Image Credits: Runa Capital

The top 50 ROSS Index participants collectively gained 12,000 contributors in 2023, while the overall GitHub star-count increased by nearly 500,000. The index also reveals that funding into the top 50 COSS startups last year hit $513 million, an increase of 32% on 2022 and 145% on 2021.

ROSS Index: Contributors, stars, and funding
ROSS Index: Contributors, stars, and funding. Image Credits: Runa Capital

Methodology and context

It’s worth looking at the methodology behind all this — what factors influence whether a company is considered “top trending”? For starters, all companies included must have at least 1,000 GitHub stars (a GitHub metric similar to a “like” in social media) to be considered. But star-count alone doesn’t tell us much about what’s trending, given that stars are accumulated over time — so a project that has been on GitHub for 10 years is likely to have accumulated more stars than one that has existed for 10 months. Instead, Runa measures the relative growth of the stars over a given period using an annualized growth rate (AGR) — this looks at the star value now versus a previous corresponding period to see what has grown most impressively.

A degree of manual curation is involved here, too, given that the goal is to eke out open source “startups” specifically — so the Runa investment team pulls out projects that belong to a “product-focused commercial organization,” and it has to have been founded fewer than 10 years ago with less than $100 million in known funding.

Defining what constitutes “open source” has its own inherent challenges, too, as there is a spectrum of how “open source” a startup is — some are more akin to “open core,” where most of their major features are locked behind a premium paywall, and some have licenses that are more restrictive than others. So for this, the curators at Runa decided that the startup must simply have a product that is “reasonably connected to its open-source [repositories],” which obviously involves a degree of subjectivity when deciding which ones make the cut.

There are further nuances at play too. The ROSS Index adopts a particularly liberal interpretation of “open source” — for example, both Elastic and MongoDB abandoned their open source roots for licenses that are “source available,” to protect themselves from being taken advantage of by the major cloud providers. According to the ROSS Index’s methodology, both these companies would qualify as “open source” — even though their licenses are not formally approved as such by the Open Source Initiative, and these specific example companies no longer refer to themselves as “open source.”

Thus, according to Runa’s methodology, it uses what it calls the “commercial perception of open-source” for its report, rather than the actual license the company attaches to its project. This means that restricted source-available licenses like BSL (business source license) and SSPL (server side public license), which MongoDB introduced as part of its transition away from open source in 2018, are very much on the menu as far as commercial companies in the ROSS Index are concerned.

“Such licenses maintain the OSS spirit — all its freedoms, except for slightly limited redistribution, which does not affect developers but grants original vendors a long-term competitive edge,” Konstantin Vinogradov, Runa Capital’s London-based general partner, explained to TechCrunch. “From a VC perspective, it is just an evolved playbook for exactly the same type of companies. The open source definition applies to software products, not companies.”

There are other notable filters in place too. For instance, companies that are mostly focused on providing professional services, or side projects with limited active support or with no commercial element, are not included in the ROSS Index.

For comparative purposes, there are other indexes and lists out there that give a steer on the “whats hot” in the open source landscape. Another VC firm called Two Sigma Ventures maintains the Open Source Index, for instance, which is similar in concept to Runa’s, except it spans all manner of open source projects (not just startups) and has additional filters in place, including the ability to view by GitHub’s “watchers” metric, which some argue gives a more accurate picture of a project’s true popularity.

GitHub itself also publishes a trending repositories page, which, similar to Two Sigma Ventures, doesn’t focus on the business behind the project.

So the ROSS Index has emerged as a useful complementary tool for figuring out which open source “startups” specifically are worth keeping tabs on.