42 Comments
User's avatar
Emiliano's avatar

I was able to correctly identify the human written version of both, but I did this via filtering for syntax rather than by content.

This is more generalizable, obviously. I tend to glaze over LLM writing and skim once I identify it. But the tells I use to identify - triplicates, inherent hedging - none of this is particularly durable. I'm curious if these tells would have been present if you crafted the prompts with specific examples of your written text rather than what's present already present in the weights.

Dylan Black's avatar

Exactly what I did as well. I think the results of the survey undersell the conclusions, this test was easy. If you strip out the tropes and de-sanitize the writing I could not have differentiated them.

Coagulopath's avatar

While this is true and almost tautologically so ("if you strip out the tropes of Stephen King's writing, it can no longer be recognized as Stephen King"), the issue is that current LLMs struggle to disguise their own tells. If it needs a human to do an editing pass, it's questionable whether it's LLM writing anymore. (It's at least not fully LLM writing.)

Dylan Black's avatar

I was trying to say something a bit less tautological. LLMs have very *local* tells, like specific words, phrases, and sentence structures. It’s true of course that if you strip out the uniqueness of Stephen King it isnt stephen King, but his style is much larger, more embedded in the entire structure. That is harder both to identify and to strip out. When I did this sweep I did things like “count the number of punch, <5 word conclusion sentences” and “count the tricolons” and the AI ones had 10x as many. That, to me, *can* be stripped out without losing the “AI-ness” of the structure, but it WILL lose the easy ID. Does that make more sense?

Rickie Elizabeth's avatar

I have yet to come across any model that has been able to maintain an author’s voice across paragraphs, even when the model is local and trained on a vast amount of work by an author. I notice they tend to wafer between excessively imitating quirks or defaulting towards a flattened average. They do better with imitating style in short sentences. With multiple paragraphs, they struggle to maintain a consistent symbolic representation of the rhetorical intent, even with long context windows.

Of course, examples do help, and a model specifically trained on your work is going to perform better at mimicking the style than one not given your work, but even then, it will still drift back into high-probability speech patterns of the learned distribution, which is what causes the “generic AI” sound. I usually can recognize AI written content immediately (within a few sentences) because of this.

The LLM models can tell you what active voice is, and summarize a full book on writing style or grammar, but I have yet to succeed at getting one to “implement” it as a deliberate stylistic rule over longer form content while simultaneously preserving coherence/semantic precision.

To get the AI to preserve style for a longer duration, you’d need to focus more on giving it explicit modeling of inter-sentential logical relationships and structure at the paragraph-level structure (and likely hierarchical rules). But the context window really still makes this difficult.

B.P. Majors's avatar

I've had "AI detection" programs identify my original writing as written by AI.

Should I ask my parents if I am really theirs?

Jomhke's avatar

Which ones? Most of the heavily-marketed ones are junk. Pangram seems pretty good in testing

zdk's avatar

Yeah pangram is not good. I’ve been able to flip between 100% AI and 100% human by flipping one key word

Jomhke's avatar

How long was the sample text, and what sort of flip? There is a minimum word count.

I tried both directions: heavy prompting to make the AI text sound human (it still identified AI), and writing myself in an AI way (it still identified human).

zdk's avatar

A few paragraphs

Michael's avatar

My intuition is that your logic about how AI just helps already pre-formed ideas be expressed is iffy. Ezra Klein has been repeatedly making the point is his columns recently that an idea doesn't really exist independently from the medium (whether that be writing or verbal delivery in a speech) that was used to express it, which is why he has been growing slightly more AI skeptical. Spell check is productivity enhancement. This is more like genuine replacement. I think you might be conflating the latter with the former.

In fact I think the "doing worse than chance" thing is not a coincidence. Real Hanania will change over time because of his lived experience. AI Hanania never changes. If Real Hanania uses AI Hanania more and more, then Real Hanania will evolve less in his thinking, and part of why I read Real Hanania in the first place is that fact his opinions change over time in response to his lived experience, which makes him interesting. If you start to use AI a lot, I think you will trap your own mind in amber, even if the prose are not bad and the ideas are coherent.

Search engines are in an interesting middle ground between glasses/spell-check and AI writing, though, because of how algorithms are used to decide what appears at the top. The gradual improvement of search engine optimization is like semi-AI in that way.

Rickie Elizabeth's avatar

On your last point, I work in search engine optimization (SEO), and now about 50% of the work I do pertains to Generative Engine Optimization (GEO) (although it has several names) and Entity SEO.

AI bots crawl content in chunks and prefer semantic triples to understand a concept and map it. When it comes to human-created content, they can get the general gist, but it still often fails to read between the lines or interpret intent when it is not explicitly stated. This can be bad, as someone could ask AI to interpret a tweet and it tells them something false about the meaning/intent, based on outdated parametric knowledge it has on the person tweeting (like, say, a handful of their stances from 2023), as it doesn’t search the internet for every prompt.

Something interesting is that in some cases you can literally inject prompts into the page content which some LLMs will sometimes follow when crawling the page. So exploiting this to mislead AI agents is an actual worry, especially if you create two webpages, one with information for human eyes and one for AI agents.

So it may be that we see something with AI similar to how Google has regularly updated its algorithm to try to outmaneuver black hat SEO tactics.

Scott Sumner's avatar

It is important not to read too much into the sort of Turing tests on painting done by Scott Alexander. It has long been known that humans could imitate the work of great artists in ways that were difficult for experts to distinguish. I have no artistic talent at all, and yet I might be able to produce a Mondrian painting that would fool the experts. But that fact would not make me anywhere near as talented an artist as Mondrian, just as the fact that I might be able to build a light bulb in my basement would not make me as talented an inventor as Edison.

It seems very possible that AIs might eventually be able to produce great art. But we are not there yet. If and when they do produce great art, I believe they will have consciousness and be deserving of "human rights".

Arif's avatar
May 17Edited

“I suspect my female fans are an unusually smart (and stunning, brave etc.) group of women” got a good chuckle out of me.

Andy Iverson's avatar

In contrast to some of the other comments here by people who are feeling superior, allow me to admit that I got both wrong, and I was especially wrong on the first one, which most people got right. This makes me wonder if I am actually kind of bad at reading.

fox's avatar
May 17Edited

I got both right on the survey and expressed confusion about what the prompt was for the abundance essay because the arguments felt weird and disjoint. Now seeing the prompts it makes much sense because you told the AI's to add in additional filler theories. The AI responses definitely lost some coherence beyond the bullet points in the prompt and that threw me a little when i was trying identify the human version because I assumed they were all supposed to cover the same points.

Also, I'm not sure how much i would read into this. If your typical output was the quality of any of the abundance essays i don't think you would have become a popular writer in the first place.

Chastity's avatar

> It’ll be interesting in the coming years to see what happens in the arms race between AI writing ability and AI plagiarism detection software. My guess is that AI writing ability has to win. What exactly rules out the possibility that it can write just as well as I can? There should come a point where it is indistinguishable.

I did listen to the Odd Lots episode where they talk to a Pangram person, it's not about the quality of the writing.

Rather, you could imagine you want to say something like, "I think Russia will invade Ukraine, with a high degree of confidence, because countries don't mass troops on other countries' borders for fun." There are lots of words you could use to express this thought, lots of ordering of the sentence. You could break it into two sentences, or maybe three. You could use words like "generally"; you could get poetic. Even that phrasing is particular to me: "for fun" could be replaced by "without the intent to use it". And so on, and so forth. Almost every word is picked out of a possibility space of thousands. Each AI tends to pick a particular path through putting together its sentences, and it must do that for every single sentence. Each sentence might individually be confused for human writing, but over time, the odds that a human "just so happens" to be writing like Gemini or Claude or ChatGPT get to be too high. Pangram was also trained in part by creating synthetic data - find a human five-star review of Denny's that's 78 words long, then ask the AI to write a five-star review of Denny's that's 78 words long in the style of the first one, then use those to train Pangram to recognize the A/B differences.

This also suggests it will be hard to get an AI model that can get around it. The fact that the LLM *isn't* just pulling words at random out of a hat is why its quality is high relative to a Markov chain. Based on how Pangram works, it seems like only a private model (e.g. a personal fine-tuned model with a large corpus) would be able to get around it. Or, I suppose, the big model creators could decide to introduce features entirely intended to get around Pangram and the like (for example, randomly using lower-probability tokens), but literally the only reason to do that is to help people cheat on tests.

Evan's avatar

I'm shocked scores were so low, I thought the test was very easy.

Eric R. Ward's avatar

It would be cool to try the next level up—tee up a situation without providing such a specific prompt—and ask Claude to perform a “Hanania-like analysis” of and write a short resulting essay.

Coagulopath's avatar

This particular Claude tic is really annoying.

"The voters have decided they don't want the candidates the donors and consultants want. They never really did."

"The voters had a different theory, which was that they didn't care. They still don't."

"Schumer is reduced to issuing supportive statements ten minutes after Mills withdraws,

pretending he was on board the whole time. He wasn't."

"The striking thing was not that the candidates sounded like libertarians. They didn’t."

My God, shut up.

Edmund  Nelson's avatar

The second one was much harder IIRC I got it right but wasn't super confident.

The use of Triads, and the negative parallelism are the 2 main giveaways.

The other thing I noticed is that you get model collapse around paragraph 4/5 a lot. It starts to read less like Hanania and more like generic Ai generated prose. It's like the tokens get gradually more AI and less like the prompt as you get more and more toward the end.

barnabus's avatar

People prefer AI art is similar to the argument "eat sh*t, trillions of flies can't be making a mistake".

The point is simply - would people who earn more than a Mio $/year prefer to buy AI art rather than entirely one-human-crafted art?

Lahey's avatar

I got both of them in seconds. The Platner articles I knew in the first sentence.

Throwing in the descriptors "bleak" or "depressing" just gave the impression of trying too hard to write an opinion piece and to convince an audience.

Imperu's avatar

"He would say great art does X, but AI does Y, which is why AI can’t produce great art."

I have an argument in this vein (it's not that original to me anyway): art brings us closer to the mind of the artist, we therefore get to know them better. AI by definition can't do that (of course, as long as it's mimicking human forms of art; maybe if it will express its own experience some day and it could be comprehensible to us, that would be valuable AI art). Note that I don't say that AI art can't be "good" or "quality" or can't win in a blind test. I just say that AI art is the equivalent of fake news - once we know it's not real, it doesn't captivate us anymore.

Kean duHelme's avatar

I think there is a parallel in (pop) music. Concerts used to be a way to burnish the "brand" and foster the sale of records, where the money was. That model was destroyed by digitalization and piracy, but (pop) music didn't die, obviously. Now, recorded songs encourage concert attendance, where the money is.

Similarly, writing will likely drive interest towards "live" events, such as author panels, and the marketable skills will be related to quick-thinking, stage presence, clear delivery etc. Aren't we already there?

Reader's avatar

I guessed both right but had no confidence I was right haha