Start Up No.2325: OpenAI study shows chatbot inaccuracy, the shoplifter class, a chatbot ran my life, eat less sugar!, and more

The advent of cheap, far-reaching and powerful drones has made the Red Sea far less safe for even big navies. CC-licensed photo by U.S. Naval Forces Central Command\/U.S. Fifth Fleet on Flickr.

You can sign up to receive each day’s Start Up post by email. You’ll need to click a confirmation link, so no spam.

A selection of 9 links for you. Still afloat. I’m @charlesarthur on Twitter. On Threads: charles_arthur. On Mastodon: https://newsie.social/@charlesarthur. Observations and links welcome.

OpenAI research finds that even its best models give wrong answers a wild proportion of the time • Futurism

Victor Tangermann:

»

OpenAI has released a new benchmark, dubbed “SimpleQA,” that’s designed to measure the accuracy of the output of its own and competing artificial intelligence models.

In doing so, the AI company has revealed just how bad its latest models are at providing correct answers. In its own tests, its cutting edge o1-preview model, which was released last month, scored an abysmal 42.7% success rate on the new benchmark.

In other words, even the cream of the crop of recently announced large language models (LLMs) is far more likely to provide an outright incorrect answer than a right one — a concerning indictment, especially as the tech is starting to pervade many aspects of our everyday lives.

Competing models, like Anthropic’s, scored even lower on OpenAI’s SimpleQA benchmark, with its recently released Claude-3.5-sonnet model getting only 28.9% of questions right. However, the model was far more inclined to reveal its own uncertainty and decline to answer — which, given the damning results, is probably for the best.

Worse yet, OpenAI found that its own AI models tend to vastly overestimate their own abilities, a characteristic that can lead to them being highly confident in the falsehoods they concoct.

LLMs have long suffered from “hallucinations,” an elegant term AI companies have come up with to denote their models’ well-documented tendency to produce answers that are complete BS.

Despite the very high chance of ending up with complete fabrications, the world has embraced the tech with open arms, from students generating homework assignments to developers employed by tech giants generating huge swathes of code.

And the cracks are starting the show. Case in point, an AI model used by hospitals and built on OpenAI tech was caught this week introducing frequent hallucinations and inaccuracies while transcribing patient interactions.

«

Nothing to see here, move along, it’s all fine.

unique link to this extract

Generative AI made all my decisions for a week. Here’s what happened • The New York Times

Kashmir Hill:

»

Generative AI took over my life.

For one week, it told me what to eat, what to wear and what to do with my kids. It chose my haircut and what colour to paint my office. It told my husband that it was OK to go golfing, in a lovey-dovey text that he immediately knew I had not written.

Generative AI, which can spin up research reports, draft emails and converse like a human being based on patterns learned from enormous data sets, is being widely adopted by industries from medicine to business consulting as a timesaving tool. It’s popping up in widely used consumer apps, including Siri and Alexa. I conducted this admittedly ridiculous experiment to see how its spread might affect the largest work force of them all: harried parents.

One expert called the stunt a “decision holiday.” Another compared it to hiring a butler to handle household logistics. But these AI butlers cost just $20 per month and report everything you say to them back to a tech company.

In all, I used two dozen generative AI tools for daily tasks and nearly 100 decisions over the course of the week. Chief among my helpers were the chatbots that every big tech company released in the wake of ChatGPT. My automated advisers saved me time and alleviated the burden of constantly making choices, but they seemed to have an agenda: turn me into a Basic B.

We’ve been worried about a future where robots take our jobs or decide that humans are earth’s biggest problem and eliminate us, but an underrated risk may be that they flatten us out, erasing individuality in favour of a bland statistical average, like the paint colour AI initially recommended for my office: taupe.

I told the chatbots that I was a journalist conducting an experiment and that I had a family, but not much more. AI’s first task was to plan our meals and generate a shopping list for the week, and within seconds, it was done. (Unfortunately, generative AI couldn’t go pick up the groceries.)

«

What sort of useless robot future is this, where the robot doesn’t do the hard work? Also, taupe? These things have no taste. (Great idea for an article, though.)
unique link to this extract

Rise of the middle-class shoplifters: Americans are stealing from stores • Business Insider

Emily Stewart:

»

A lot of people steal, from small-stakes stuff at the drugstore to larger items worth hundreds of dollars at hardware chains. Their motivations are generally not the direct result of economic need, but instead, people make a moral (or amoral) judgment about what goods are unjustly expensive, especially as they deal with the recent bout of inflation. They view it as a way to get back at The Man — many have concocted a code of conduct that amounted to pilfering only from big, evil retailers (and, in one case, overpriced corporate ski resorts).

Case in point: a lot of the one-off shoplifters I talked to steal from Whole Foods with a very clean conscience. “No, I don’t feel bad about stealing from Jeff Bezos,” one 20-something occasional shoplifter in Washington, DC, told me. Her loot of choice is passion fruit, which she rings up as a cheaper item — bananas. She’s even memorized the code: 4011.

Another shoplifter, a 30-something man in New York who asked to be referred to as the “Parmesan cheese bandit,” echoed the anti-Bezos sentiment. The only people who know about his habit of stuffing a block of Whole Foods cheese into his sweatpants pocket after hitting the gym (which he developed after seeing some TikTok videos about Parmesan’s high protein content) are his brother — and, he said, “maybe the fucking surveillance people, I don’t know.”

The National Association of Shoplifting Prevention estimates that about one in 11 people has shoplifted during their lifetime and that men and women are equally likely to be the culprit. Some surveys suggest that number could be higher, like one in five. Survey data, however, often doesn’t account for the difference between someone who shoplifted a candy bar one time as a kid and someone who does it with regularity.

«

“Stealing from Bezos isn’t stealing” is an unsurprising response.
unique link to this extract

Exposure to sugar rationing in the first thousand days of life protected against chronic disease • Science

Tadeja Gracner, Claire Boone and Paul Gertler:

»

We examined the impact of sugar exposure within 1000 days since conception on diabetes and hypertension, leveraging quasi-experimental variation from the end of the United Kingdom’s sugar rationing in September 1953.

Rationing restricted sugar intake to levels within current dietary guidelines, yet consumption nearly doubled immediately post-rationing. Using an event study design with UK Biobank data comparing adults conceived just before or after rationing ended, we found that early-life rationing reduced diabetes and hypertension risk by about 35% and 20%, respectively, and delayed disease onset by 4 and 2 years.

Protection was evident with in-utero exposure and increased with postnatal sugar restriction, especially after six months when solid foods likely began. In-utero sugar rationing alone accounted for about one third of the risk reduction.

«

War is good for you, part 12 (at least for those looking back fondly on rationing). People who lived through rationing also tended to live longer. This study provides an argument for less added sugar in food – there’s certainly too much in it. One can guess though that this is probably going to be used, at least in some areas, to insist that pregnant women mustn’t eat various sweet things.
unique link to this extract

The Red Sea is now so dangerous even NATO warships are avoiding it • GCaptain

John Konrad:

»

The Red Sea, one of the world’s busiest and most strategically vital waterways, has become so hazardous that even the German Navy is steering clear. Defense Minister Boris Pistorius’s decision to redirect the frigate Baden-Württemberg and support vessel Frankfurt am Main around the Cape of Good Hope on their return from an Indo-Pacific deployment speaks volumes.

The Red Sea is now deemed too perilous, underscoring just how ineffective current US and EU naval protections are in this region.

For months, the U.S. and EU have stationed forces to secure the Red Sea’s shipping lanes. Yet, Houthi rebels, equipped and backed by Iran, continue to harass and attack vessels under the guise of “solidarity” with Palestinian forces in Gaza. Reports reveal Houthi attacks extending into the Indian Ocean and even the Mediterranean, a spread that demonstrates their increased capability and adaptability.

The EU’s mission Aspides commander warned of escalating danger but lacked the ships and resources needed to respond adequately. The United States Navy continues to send warships through the Red Sea, but its mission to protect merchant ships—Operation Prosperity Guardian—is considered a failure by several naval experts we interviewed and has significantly diminished in scope and size. As a result, even many US-flagged commercial vessels – which the US Navy is obligated by law to protect – are opting to divert their routes around Africa.

«

The advent of drone warfare – low-cost, high-impact, hard to deflect – has changed the entire face of naval and land-based warfare.
unique link to this extract

A Russian disinfo campaign is using comment sections to seed pro-Trump conspiracy theories • WIRED

David Gilbert:

»

“Video has come out from Bucks County, Pennsylvania showing a ballot counter destroying ballots for Donald Trump and keeping Kamala Harris’s ballots for counting,” an account called “Dan from Ohio” wrote in the comment section of the far-right website Gateway Pundit. “Why hasn’t this man been arrested?”

But Dan is not from Ohio, and the video he mentioned is fake. He is in fact one of hundreds of inauthentic accounts posting in the unmoderated spaces of right-wing news site comment sections as part of a Russian disinformation campaign. These accounts were discovered by researchers at media watchdog NewsGuard, who shared their findings with WIRED.

“NewsGuard identified 194 users that all target the same articles, push the same pro-Russian talking points and disinformation narratives, while masquerading as disgruntled Western citizens,” the report states. The researchers found these fake accounts posting comments in four pro-Trump US publications: the Gateway Pundit, the New York Post, Breitbart, and Fox News. They were also posting similar comments in the Daily Mail, a UK tabloid, and French website Le Figaro.

“FOX News Digital’s comment sections are monitored continuously in real time by the outside company OpenWeb which services multiple media organizations,” a spokesperson for the company tells WIRED. “Comments made by fake personas and professional trolls are removed as soon as issues are brought to our attention by both OpenWeb and the additional internal oversight mechanisms we have in place.”

«

It is exhausting, in its way. Will it all stop, or at least slow down substantially, after Tuesday? Here’s hoping.
unique link to this extract

Can the Daily Beast claw its way back to relevance? • The New York Times

Katie Robertson:

»

[New co-editor at the Daily Beast, Joanna] Coles has a specific vision for what The Beast can do: shorter and sharper articles that focus on people, power, politics and pop culture, with a dose of satire to lighten the mood during a perilous time. She says she is fascinated by the extremes of wealth and power and behavior, and pointed to articles about A-list celebrities who gave Sean Combs cover and the troubles of Will Lewis, the chief executive of The Washington Post, as recent highlights.

“I wanted something that curated a lot of news out there that wasn’t about the end of democracy all the time,” she said. Ms. Coles said she felt that The Beast had “some very good editors” over the years, but that its place in the media landscape had been diminished.

“I certainly wasn’t reading it on a regular basis,” she said. “Another editor said to me when we came here, he said: ‘It’s the boring avatar of the resistance.’ I thought he summed it up in one.”

She is trying to spark ideas and reinvigorate the newsroom’s culture, which she said had been affected when the pandemic caused more people to work remotely. Beast employees are now required to work in the office four days a week. “We have some really good people here who were here all along who are excited by the mission and re-energized,” Ms. Coles said.

There are others she was not as fond of. Ms. Coles recounted that a political reporter did not call in to the newsroom on the Sunday that President Biden announced he would drop out of the presidential race, or the previous weekend, after the assassination attempt of Mr. Trump. in Butler, Pa.

Ms. Coles was at her house in the Hamptons on the day of the assassination attempt, hosting her friend Emma Tucker, the editor in chief of The Wall Street Journal and a fellow Briton, and the pair “turned my dining room table into an ad hoc newsroom.”

“I was incredulous that a political reporter would not call in because it was the weekend,” she said. “To me it was madness that we would have political correspondents who didn’t want to cover that story immediately.” That political reporter, Jake Lahut, who has since left The Daily Beast, referred The New York Times to the three articles he filed the day that Mr. Biden dropped out, and said that he was out of the country on leave the previous weekend but had contacted the Beast’s weekend desk. “The most important lesson I’ve learned from this is how not to run a newsroom,” Mr. Lahut said. He added: “I think my track record speaks for itself.”

«

I think Coles is fighting an uphill battle with the Daily Beast staff there. It isn’t going to end well.
unique link to this extract

Apple is ‘seriously considering’ Vision device that offloads compute to your iPhone • 9to5Mac

Michael Burkhardt:

»

According to Mark Gurman from Bloomberg, Apple is ‘seriously considering’ developing a cheaper Apple Vision device that offloads all of the computing power to your iPhone, essentially developing a headset with primarily displays and battery.

Gurman has mentioned this idea in the past, but it was just one of the many ideas Apple was toying with for future products in the Apple Vision family, however it sounds like the idea might have more merit now.

This product would be similar to other products on the market like the Xreal glasses, which show content from your iPhone through the displays of the glasses.

Earlier this morning, Ming Chi-Kuo reported that Apple’s cheaper Vision headset has been delayed beyond 2027, so it’s highly possible that this “Vision as an iPhone accessory” product could come in place of it. Gurman says this product “would reinforce the iPhone as the center of [Apple’s] product ecosystem.”

«

Pretty hard to avoid the iPhone – after all, everyone has a smartphone. The Vision Pro already looks like a dead duck, because there simply isn’t enough content for it. Despite Apple’s wish for it to be a “spatial computing” device that people use to do work, that isn’t how people think of these devices; they think of them as consumption devices.
unique link to this extract

‘Infinite monkey theorem’ challenged by Australian mathematicians • BBC News

Hannah Ritchie:

»

Two Australian mathematicians have called into question an old adage, that if given an infinite amount of time, a monkey pressing keys on a typewriter would eventually write the complete works of William Shakespeare.

Known as the “infinite monkey theorem”, the thought-experiment has long been used to explain the principles of probability and randomness.

However, a new peer-reviewed study led by Sydney-based researchers Stephen Woodcock and Jay Falletta has found that the time it would take for a typing monkey to replicate Shakespeare’s plays, sonnets and poems would be longer than the lifespan of our universe.

Which means that while mathematically true, the theorem is “misleading”, they say.

As well as looking at the abilities of a single monkey, the study also did a series of calculations based on the current global population of chimpanzees, which is roughly 200,000.

The results indicated that even if every chimp in the world was enlisted and able to type at a pace of one key per second until the end of the universe, they wouldn’t even come close to typing out the Bard’s works.

There would be a 5% chance that a single chimp would successfully type the word “bananas” in its own lifetime. And the probability of one chimp constructing a random sentence – such as “I chimp, therefore I am” – comes in at one in 10 million billion billion, the research indicates.

“It is not plausible that, even with improved typing speeds or an increase in chimpanzee populations, monkey labour will ever be a viable tool for developing non-trivial written works,” the study says.

«

“While mathematically true, the theorem is misleading”? This is just abject nonsense. Nobody is suggesting monkey labour for anything. This is purely a thought experiement about the nature of infinity, and our incapability of understanding it.

Bob Newhart would be proud. “If they ever tried this, they would have to hire guys to check whether the monkeys were turning out anything worthwhile…”
unique link to this extract

• Why do social networks drive us a little mad?
• Why does angry content seem to dominate what we see?
• How much of a role do algorithms play in affecting what we see and do online?
• What can we do about it?
• Did Facebook have any inkling of what was coming in Myanmar in 2016?

Read Social Warming, my latest book, and find answers – and more.

Errata, corrigenda and ai no corrida: none notified

The Overspill: when there's more that I want to say

Charles Arthur's site for links, observations and writing