"AI" to Summarize a Document

Started by
11 comments, last by Timkin 19 years, 9 months ago
Here's an AI problem that sounds very difficult to me... How would you write a computer program that could read a document (say a chapter out of a book) and give you a one paragraph summary of its contents? I think this would be an incredibly useful program to have to build in to all sorts of agent-based information gathering tools, but even after thinking about it for a while, I don't see where you would even start to begin creating this. The real problem is that to summarize a document, you have to understand it at some level. This seems a step above and beyond writing a chatterbot - the next closest problem in my mind. Has anyone done work on this?

Shedletsky's Bits: A Blog | ROBLOX | Twitter
Time held me green and dying
Though I sang in my chains like the sea...

Advertisement
I haven't done work on it but I know where I'd start.

I would first write a program to analysis english grammer and then do my own analysis of how people best pick out facts from a chapter of text and reword them. That would be a starting point.
The grammer analysis engine would depend on the quality of the writing. Even I can barely understand the point of some of the posts on this forum with exceptionally poor grammer... and spelling... and content...

As an amusing side note- I should mention that I enjoy tormenting the little kid next door by stopping her in the middle of a long drawn out story and telling her to summarize the rest in one sentence with no "ands". You gotta try it... Its hillarious.
[s]I am a signature virus. Please add me to your signature so that I may multiply.[/s]I am a signature anti-virus. Please use me to remove your signature virus.
I think a lot of this would have to do with keywords. If you look at the technology behind Google's "AdWords" advertising application, they are doing something similar. Their bot will scan a page and determine what it believes the page is about - and then serve ads that match that topic. This only manages to categorize things, though.

Writing a summary is a whole different animal, however, in that you have to then formulate some sort of stance on what the article is about. For example, you could determine with word analysis that an article is about global warming. However, how do you even determine that the article is claiming it does or does not exist? Even if you could determine that, how do you then write a summary that expresses WHY the article believes that it does or does not exist?

Dave Mark - President and Lead Designer of Intrinsic Algorithm LLC
Professional consultant on game AI, mathematical modeling, simulation modeling
Co-founder and 10 year advisor of the GDC AI Summit
Author of the book, Behavioral Mathematics for Game AI
Blogs I write:
IA News - What's happening at IA | IA on AI - AI news and notes | Post-Play'em - Observations on AI of games I play

"Reducing the world to mathematical equations!"

It's hard enough for humans to write such a summary.

(Hey, I haven't met anyone who actually liked doing book
reports
for English class in high school.)

I don't even want to start thinking what will be needed for
an AI to do it. [razz]
神はサイコロを振らない!
TangentZ: You're right. Writing book reports is long, tedious, and no one likes to do it. That's why writing the first information assimilation agents, if it's even possible, would make you richer than brin and page.

Shedletsky's Bits: A Blog | ROBLOX | Twitter
Time held me green and dying
Though I sang in my chains like the sea...

I could see such a product putting a lot of secretaries out of business. Are you sure you want to do that? and help proliferate the McDonalds-ization of the workplace?

Anyhow, one place you might want to look as a starting point, is ETS educational testing services, on the SAT2 writing tests I belive they use a computer algorithm to assign scores. I dont know the exact name or method, just heard about it in an old technology magazine. It has something to do with keywords and stochastic comparison to a set of known 'good' essays on the same subject.
Maybe just "extracting" simple facts could do the job. I mean, finding out the words that act upon others and/or forming a short paragraph from long ones.

Quote:Man Survives Attack by Punching Gator
Wednesday, July 28, 2004

•Florida Woman Dies After Alligator Attack
TAVARES, Fla. — An 11-foot alligator attacked a man pulling weeds along the shore of a lake, but he saved himself by punching the beast in the nose.

Guy R. Daelemans, 43, suffered leg wounds in Tuesday's attack on Lake Eustis (search), Lake County sheriff's Lt. Todd Luce said. He was treated and released from a hospital.

A trapper summoned by wildlife officials later caught the 385-pound alligator (search), which was then killed.

Last week, a 54-year-old landscaper died of an infection, two days after a 12-foot alligator dragged her into a pond and nearly tore off her right arm as she worked behind a home on Sanibel Island (search).

That alligator was also trapped and killed.


"Man Survives Attack by Punching Gator"
"Woman Dies"
"alligator attacked a man"
"he saved himself by punching the beast"
"Guy R. Daelemans, 43, suffered leg wounds "
"He was treated and released"
"A trapper caught the 385-pound alligator"
"54-year-old landscaper died"
"alligator dragged her"
"alligator was trapped and killed"
[size="2"]I like the Walrus best.
Microsoft Word used to have an AutoSummarize feature which would automatically summarize your text into a certain word or sentence limit. It didn't work very well since in summaries we mainly look for the main ideas of a text, but it was a decent start I'd say.
h20, member of WFG 0 A.D.
There are at least three branches of AI that are working toward this goal...

The first involves (as Dave pointed out) analysing and classifying documents based on keyword searches. This is a key task in document retreival tasks and automated content delivery systems (i.e., systems that scan large databases - like the web - and deliver to you new content related to topics it believes you would be interested in, typically based on preferences, past behaviours and feedback from previous delivery).

The second area of relevance to this problem is argument generation: the ability to decipher from a set of statements a logical argument as to what the facts say and then making logical deductions from them (or inductions, if you whack an inductive engine into the system...which is harder, since it requires hypothesis testing within the conversation, which is not always possible with written text).

The third area of relevance is natural discourse planning, which has as a related problem, content understanding (which is related to the argument generation task). Discourse planning involves the task of deducing what to say next to convey an idea, with the constraint that what you say should generally flow from what was said before (unless you're changing tack in the conversation).

We're not there yet, although there are some very nice systems in each of these areas that lead me to believe that we'll be there in 10-15 years.
Cheers,

Timkin

This topic is closed to new replies.

Advertisement