Beyond Attributes Describing Images Tamara L Berg UNC
Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill
Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” Scarlett O’Hara described in Gone with the Wind. Berg, Attributes Tutorial CVPR 13
Recognition… person car shoe Berg, Attributes Tutorial CVPR 13
Toward Complex Structured Outputs car Berg, Attributes Tutorial CVPR 13
Toward Complex Structured Outputs pink car Attributes of objects Berg, Attributes Tutorial CVPR 13
Toward Complex Structured Outputs car on road Relationships between objects Berg, Attributes Tutorial CVPR 13
Toward Complex Structured Outputs Little pink smart car parked on the side of a road in a London shopping district. … Complex structured recognition outputs Telling the “story of an image” Berg, Attributes Tutorial CVPR 13
Learning from Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” Scarlett O’Hara described in Gone with the Wind. How does the world work? Visually descriptive language provides: • Information about the world, especially the visual world. • information about how people construct natural language for imagery. • guidance for visual recognition. What should we recognize? How do people describe the world? Berg, Attributes Tutorial CVPR 13
Methodology A random Pink Smart Car seen driving around Lambeth Roundabout and onto Lambeth Bridge. Smart Car. It was so adorable and cute in the parking lot of the post office, I had to stop and take a picture. Pink Car Sign Door Motorcycle Tree Brick building Dirty Road Sidewalk London Shopping district Natural language description Generation Methods: 1) Compose descriptions directly from recognized content 2) Retrieve relevant existing text given recognized content Berg, Attributes Tutorial CVPR 13
Related Work • Compose descriptions given recognized content Yao et al. (2010), Yang et al. (2011), Li et al. ( 2011), Kulkarni et al. (2011) • Generation as retrieval Farhadi et al. (2010), Ordonez et al (2011), Gupta et al (2012), Kuznetsova et al (2012) • Generation using pre-associated relevant text eong et al (2010), Aker and Gaizauskas (2010), Feng and Lapata (2010 a) • Other (image annotation, video description, etc) Barnard et al (2003), Pastra et al (2003), Gupta et al (2008), Gupta et al (2009), Feng and Lapata (2010 b), del Pero et al (2011), Krishnamoorthy et al (2012), Barbu et al (2012), Das et al (2013) Berg, Attributes Tutorial CVPR 13
Method 1: Recognize & Generate Berg, Attributes Tutorial CVPR 13
Baby Talk: Understanding and Generating Simple Image Descriptions Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, Tamara L Berg CVPR 2011
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. ” Kulkarni et al, CVPR 11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. ” Kulkarni et al, CVPR 11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. ” Kulkarni et al, CVPR 11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. ” Kulkarni et al, CVPR 11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. ” Kulkarni et al, CVPR 11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. ” Kulkarni et al, CVPR 11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. ” Kulkarni et al, CVPR 11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. ” Kulkarni et al, CVPR 11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. ” Kulkarni et al, CVPR 11
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. ” Kulkarni et al, CVPR 11
Methodology • Vision -- detection and classification • Text inputs - statistics from parsing lots of descriptive text • Graphical model (CRF) to predict best image labeling given vision and text inputs • Generation algorithms to generate natural language Kulkarni et al, CVPR 11
Vision is hard! Green sheep World knowledge (from descriptive text) can be used to smooth noisy vision predictions! Kulkarni et al, CVPR 11
Methodology • Vision -- detection and classification • Text -- statistics from parsing lots of descriptive text • Graphical model (CRF) to predict best image labeling given vision and text inputs • Generation algorithms to generate natural language Kulkarni et al, CVPR 11
Learning from Descriptive Text Attributes green grass by the lake a very shiny car in the car museum in my hometown of upstate NY. Relationships very little person in a big rocking chair Our cat Tusik sleeping on the sofa near a hot radiator. Kulkarni et al, CVPR 11
Methodology • Vision -- detection and classification • Text -- statistics from parsing lots of descriptive text • Model (CRF) to predict best image labeling given vision and text based potentials • Generation algorithms to compose natural language Kulkarni et al, CVPR 11
System Flow a) dog b) person brown 0. 01 near(a, b) 1 striped 0. 16 near(b, a) 1 furry. 26 against(a, b) . 11 wooden. 2 feathered against(b, a) . 06. 04 . . . beside(a, b) . 24 brown 0. 32 beside(b, a) striped 0. 09 near(a, c) 1 . 17 furry. 04 near(c, a) 1 . . . wooden. 2 against(a, c). 3 Feathered against(c, a) . 04. 05 . . . beside(a, c). 5 beside(c, a). 45 This is a photograph of one person and one brown sofa and one dog. The person is against the brown sofa. And the dog is near the person, and beside the brown sofa. . . . Input Image near(b, c) 1 brown 0. 94 near(c, b) 1 striped 0. 10 <<null, person_b>, against, <brown, sofa_c>> against(b, c) furry. 06 <<null, dog_a>, near, <null, person_b>> . 67 wooden. 8 <<null, dog_a>, beside, <brown, sofa_c>> against(c, b) Feathered . 33 . 08 beside(b, c). 0. . . beside(c, b). 19 Generate natural language Predict labeling – vision c) sofa description potentials smoothed with text Extract Objects/stuff Predict prepositions. . . Predict attributes Kulkarni et al, CVPR 11 potentials
Some good results This is a picture of one sky, one road and one sheep. The gray sky is over the gray road. The gray sheep is by the gray road. Here we see one road, one sky and one bicycle. The road is near the blue sky, and near the colorful bicycle. The colorful bicycle is within the blue sky. This is a picture of two dogs. The first dog is near the second furry dog. Kulkarni et al, CVPR 11
Some bad results Missed detections: Here we see one potted plant. This is a picture of one dog. False detections: There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one road and one person. The rusty tree is under the red road. The colorful person is near the rusty tree, and under the red road. Incorrect attributes: This is a photograph of two sheeps and one grass. The first black sheep is by the green grass, and by the second black sheep. The second black sheep is by the green grass. This is a photograph of two horses and one grass. The first feathered horse is within the green grass, and by the second feathered horse. The second feathered horse is within the green Kulkarni et al, CVPR 11
Algorithm vs Humans “This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. ” H 1: A Lemonaide stand is manned by a blonde child with a cookie. H 2: A small child at a lemonade and cookie stand on a city corner. H 3: Young child behind lemonade stand eating a cookie. Sounds unnatural! Kulkarni et al, CVPR 11
Method 2: Retrieval based generation Berg, Attributes Tutorial CVPR 13
Every picture tells a story, describing images with meaningful sentences Ali Farhadi, Mohsen Hejrati, Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth ECCV 2010 Slides provided by Ali Farhadi
A Simplified Problem Represent image/text content as subject-verb-scene triple Good triples: • (ship, sail, sea) • (boat, sail, river) • (ship, float, water) Bad triples: • (boat, smiling, sea) – bad relations • (train, moving, rail) – bad words • (dog, speaking, office) - both Farhadi et al, ECCV 10
The Expanded Model • Map from Image Space to Meaning Space • Map from Sentence Space to Meaning Space • Retrieve Sentences for Images via Meaning Space Farhadi et al, ECCV 10
Retrieval through meaning space • Map from Image Space to Meaning Space • Map from Sentence Space to Meaning Space • Retrieve Sentences for Images via Meaning Space Farhadi et al, ECCV 10
Image Space Meaning Space Predict Image Content using trained classifiers Farhadi et al, ECCV 10
Retrieval through meaning space • Map from Image Space to Meaning Space • Map from Sentence Space to Meaning Space • Retrieve Sentences for Images via Meaning Space Farhadi et al, ECCV 10
Sentence Space Meaning Space • Extract subject, verb and scene from sentences in the training data black cat over pink chair A black color cat sitting on chair in a room. cat sitting on a chair looking in a mirror. Subject: Cat Verb: Sitting Scene: room • Use taxonomy trees Object Animal Cat Dog Horse Human Vehicle Car Bike Train Farhadi et al, ECCV 10
Retrieval through meaning space • Map from Image Space to Meaning Space • Map from Sentence Space to Meaning Space • Retrieve Sentences for Images via Meaning Space Farhadi et al, ECCV 10
Farhadi et al, ECCV 10
Farhadi et al, ECCV 10
Farhadi et al, ECCV 10
Data 1, 000 images 20, 000 images More data needed? Rashtchian et al 2010, Farhadi et al 2010 5 descriptions per image 20 object categories Image-Clef challenge 2 descriptions per image Select image categories Large amounts of paired data can help us study the image-language relationship Berg, Attributes Tutorial CVPR 13
Through the smoke Duna Portrait #5 Mirror and gold the cat lounging in the sink Data exists, but buried in Berg, Attributes Tutorial CVPR 13
n o i ill ned m 1 ptio s! ca oto ph SBU Captioned Photo Dataset http: //tamaraberg. com/sbucaptions The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon Man sits in a rusted car buried in the sand on Waitarere beach Little girl and her dog in northern Thailand. They both seemed interested in what we were doing Our dog Zoe in her bed 1 m ca illio ph ptio n oto ne s! d Interior design of modern white and brown living room furniture against white wall with a lamp hanging. Emma in her hat looking super cute Berg, Attributes Tutorial CVPR 13
“Im 2 Text: Describing Images Using 1 Million Captioned Photographs” Vicente Ordonez, Girish Kulkarni, Tamara L. Berg NIPS 2011
Big Data Driven Generation An old bridge over dirty green water. One of the many stone bridges in town that carry the gravel carriage roads. A stone bridge over a peaceful river. Generate natural sounding descriptions using existing captions Ordonez et al, NIPS 11
Harness the Web! Global Matching (GIST + Color) SBU Captioned Photo Dataset 1 million captioned images! The water is clear enough to see fish swimming around in it. A walk around the lake near our house with Abby. Bridge to temple in Hoan Kiem lake. Transfer Caption(s) e. g. “The water is clear enough to see fish swimming around in it. ” Smallest house in paris Hangzhou bridge in between red (on right) West lake. and beige (on left). … The daintree river by boat. Ordonez et al, NIPS 11
Use High Level Content to Rerank (Objects, Stuff, People, Scenes, Captions) The bridge over the Iron bridge over the Duck lake on Suzhou Street. river. Transfer Caption(s) e. g. “The bridge over the lake on Suzhou Street. ” The Daintree river by boat. Bridge over Cacapon river. . Ordonez et al, NIPS 11
Results Bad Good Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. Cat in sink. A female Mallard duck in the lake at Luukki Espoo. The cat in the window. Fresh fruit and vegetables at the market in Port Louis Mauritius. The boat ended up a kilometre from the water in the middle of the airstrip. Ordonez et al, NIPS 11
Next…. Composing novel captions from pieces of existing ones Berg, Attributes Tutorial CVPR 13
Composing captions guessing game a) monkey playing in the tree canopy, Monte Verde in the rain forest b) capuchin monkey in front of my window c) monkey spotted in Apenheul Netherlands under the tree d) a white-faced or capuchin in the tree in the garden e) the monkey sitting in a tree, posing for his picture Berg, Attributes Tutorial CVPR 13
Composing captions guessing game a) monkey playing in the tree canopy, Monte Verde in the rain forest b) capuchin monkey in front of my window c) monkey spotted in Apenheul Netherlands under the tree d) a white-faced or capuchin in the tree in the garden e) the monkey sitting in a tree, posing for his picture Berg, Attributes Tutorial CVPR 13
“Collective Generation of Natural Image Descriptions” Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg and Yejin Choi ACL 2012
Composing Descriptions Object appearance NP: the dirty sheep Object pose Scene appearance VP: meandered along a desolate road PP: in the highlands of Scotland Region appearance & relationship PP: through frozen grass Example Composed Description: the dirty sheep meandered along a desolate road in the highlands of Scotland Kuznetsova et al, ACL 12 through frozen grass
n o i ill ned m 1 ptio s! ca oto ph SBU Captioned Photo Dataset http: //tamaraberg. com/sbucaptions The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon Man sits in a rusted car buried in the sand on Waitarere beach Little girl and her dog in northern Thailand. They both seemed interested in what we were doing Our dog Zoe in her bed 1 m ca illio ph ptio n oto ne s! d Interior design of modern white and brown living room furniture against white wall with a lamp hanging. Emma in her hat looking super cute Ordonez et al, NIPS 11
Data Processing 1, 000 images: o Run object detectors o Run region based stuff detectors (grass, sky, etc. ) o Run global scene classifiers o Parse captions associated with images and retrieve phrases referring to objects (NPs, VPs), region relationships (PPstuff), and general scene context (PPscene). Kuznetsova et al, ACL 12
Image Description Generation Computer Vision Generation Objects, Actions, Stuff, Scenes Phrase Retrieval Description Kuznetsova et al, ACL 12
Image Description Generation Computer Vision Generation Objects, Actions, Stuff, Scenes Phrase Retrieval Description Kuznetsova et al, ACL 12
Retrieving VPs Contented dog just laying on the edge of the road in front of a house. . Peruvian dog sleeping on city street in the city of Cusco, (Peru) Detect: dog Find matching detections by pose similarity this dog was laying in the middle of the road on a back street in jaco Closeup of my dog sleeping under my desk. Kuznetsova et al, ACL 12
Retrieving NPs Tray of glace fruit in the market at Nice, France Fresh fruit in the market A box of oranges was just catching the sun, bringing out detail in the skin. The street market in Santanyi, Mallorca is a must for the oranges and local crafts. Detect: fruit Find matching detections by appearance similarity Kuznetsova et al, ACL 12 mandarin oranges in glass bowl An orange tree in the backyard of the house.
Retrieving PPstuff Find matching regions by appearance + arrangement similarity Detect: stuff Cordoba - lonely elephant I positioned the chairs under an orange tree. . . around the lemon tree -- it's like a shrine Comfy chair under a tree. Mini Nike soccer ball alone in the grass Kuznetsova et al, ACL 12
Retrieving PPscene Extract scene descriptor Find matching images by global scene similarity Kuznetsova et al, ACL 12 I'm about to blow the building across the street over with my massive lung power. Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere Only in Paris will you find a View from our B&B in this bottle of wine on a table photo outside a bookstore
Image Description Generation Computer Vision Generation Objects, Actions, Stuff, Scenes Phrase Retrieval Description Kuznetsova et al, ACL 12
Object NPs birds the bird Actions VPs are standing looking for food Stuff PPs in water over water Scene PPs in the ocean near Salt Pond Position 1 Position 2 Position 3 Position 4 birds over water are standing in the ocean Kuznetsova et al, ACL 12
Possible Assignments Position 1 Position 2 Position 3 Position 4 birds the bird are standing … … in the ocean Kuznetsova et al, ACL 12
Possible Assignments Position 1 Position 2 Position 3 Position 4 birds the bird are standing … … in the ocean Kuznetsova et al, ACL 12
Possible Assignments Position 1 Position 2 Position 3 Position 4 birds the bird are standing … … in the ocean Kuznetsova et al, ACL 12
Phrases of the Same Type Position 1 Position 2 Position 3 Position 4 birds the bird are standing … … in the ocean Kuznetsova et al, ACL 12
Singular/Plural Relationships Position 1 Position 2 Position 3 Position 4 birds the bird are standing … … in the ocean Kuznetsova et al, ACL 12
ILP Optimization Vision scores o Visual detection/classification scores Optimize for: Phrase cohesion o n-gram statistics between phrases o Co-occurrence statistics between phrase head words Linguistic constraints Subject to: o Allow at most one phrase of each type o Enforce plural/singular agreement between NP and VP Discourse constraints o Prevent inclusion of repeated phrasing Kuznetsova et al, ACL 12
Good Examples This is a brass viking boat moored on beach in Tobago by the ocean. The clock made in Korea. Kuznetsova et al, ACL 12 This is a sporty little red convertible made for a great day in Key West FL. This car was in the 4 th parade of the apartment buildings.
Visual Turing Test Us vs Original Human Written Caption In some cases (16%), ILP generated captions were preferred over human written ones! Kuznetsova et al, ACL 12
Bad Results Not Relevant Grammatically Incorrect Cognitive Absurdity One of the most shirt in the wall of the house. Here you can see a cross by the frog in the sky. Computer Vision Error This is a shoulder bag with a blended rainbow effect. Kuznetsova et al, ACL 12
Questions?
- Slides: 76