Visual Question Answering Aaron Honculada Aisha Urooj Dr

TVQA Dataset • 460 hours of video • 152, 545 Question and Answer Pairs

Questions • Main Question part • Grounding part • Temporal Localization • Each clip

TVQA • Subtitles • Visual Concepts • Object detection • Concatenate • Remove duplicates

Results Model Used TVQA + S Accuracy (%) Reported 65. 15% Replication 65. 74%

Results Model Used TVQA + S TVQA + V Accuracy (%) Reported 65. 15%

Results Model Used TVQA + S TVQA + V TVQA + IMG Accuracy (%)

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA +

Summary and Next Steps • Reproduced Results • Baseline Results • Look into network

Slides: 24

Download presentation

Visual Question Answering Aaron Honculada Aisha Urooj Dr. Mubarak Shah, Dr. Niels Lobo

TVQA Dataset • 460 hours of video • 152, 545 Question and Answer Pairs • 21, 793 clips (60 -90 sec) • Multimodal Compositionality • Video-QA • Associated natural language (subtitles)

Questions • Main Question part • Grounding part • Temporal Localization • Each clip has 7 questions • Each question has 5 multiple choice answers

TVQA

TVQA • Subtitles • Visual Concepts • Object detection • Concatenate • Remove duplicates • Video Features • Res. Net

Model Used

Baseline Models • LSTM • Bi. LSTM

Baseline Models • Baseline CNN+LSTM

Results Model Used TVQA + S Accuracy (%) Reported 65. 15% Replication 65. 74%

Results Model Used TVQA + S TVQA + V Accuracy (%) Reported 65. 15% 45. 03% Replication 65. 74% 45. 25%

Results Model Used TVQA + S TVQA + V TVQA + IMG Accuracy (%) Reported 65. 15% 45. 03% 43. 78% Replication 65. 74% 45. 25% 44. 42%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65. 15% 45. 03% 43. 78% N/A Replication 65. 74% 45. 25% 44. 42% 45. 52%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65. 15% 45. 03% 43. 78% N/A Replication 65. 74% 45. 25% 44. 42% 45. 52% Q LSTM 42. 74% Bi. LSTM 42. 48%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65. 15% 45. 03% 43. 78% N/A Replication 65. 74% 45. 25% 44. 42% 45. 52% Q S+Q LSTM 42. 74% 42. 71% Bi. LSTM 42. 48% 42. 67%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65. 15% 45. 03% 43. 78% N/A Replication 65. 74% 45. 25% 44. 42% 45. 52% Q S+Q V+Q LSTM 42. 74% 42. 71% 42. 61% Bi. LSTM 42. 48% 42. 67%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65. 15% 45. 03% 43. 78% N/A Replication 65. 74% 45. 25% 44. 42% 45. 52% Q S+Q V+Q S+V+Q LSTM 42. 74% 42. 71% 42. 61% 42. 39% Bi. LSTM 42. 48% 42. 67% 42. 84%

Results Model Used TVQA + S TVQA + V TVQA + IMG TVQA + V + IMG Accuracy (%) Reported 65. 15% 45. 03% 43. 78% N/A Replication 65. 74% 45. 25% 44. 42% 45. 52% Q S+Q V+Q (FC) V + Q S+V+Q LSTM 42. 74% 42. 71% 42. 61% 42. 85% 42. 39% Bi. LSTM 42. 48% 42. 67% 42. 86% 42. 84%

Results

Summary and Next Steps • Reproduced Results • Baseline Results • Look into network mistakes and address them • Main Goal: Boost Performance Using Visual Cues effectively