neil andersonjunhong

54
Visually Extrac.ng Data Records from the Deep Web Neil Anderson and Jun Hong Queen’s University Belfast, UK

Upload: neildaaanderson

Post on 25-Jun-2015

114 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neil andersonjunhong

 Visually  Extrac.ng  Data  Records  

from  the  Deep  Web      

Neil  Anderson  and  Jun  Hong  Queen’s  University  Belfast,  UK  

Page 2: Neil andersonjunhong
Page 3: Neil andersonjunhong
Page 4: Neil andersonjunhong

Data  Record  Extrac.on  

 Given  a  query  result  page  containing  a  set  of  data  records,  our  goal  is  to  group  the  data  items  and  labels  of  each  data  record  together.    

Page 5: Neil andersonjunhong

Title  

Page 6: Neil andersonjunhong

Previous  Approaches  

•  Common  theme  is  to  iden.fy  repeated  paJerns  •  Source  code  and  regular  expressions    –  JavaScript  makes  this  tricky    

•  Supervised  learning  with  annotated  pages  – Wrapper  induc.on  

•  Tag  tree  representa.on  (DOM)  – Hierarchical  representa.on  of  the  page,  designed  for  the  browser,  not  for  humans  

– Doesn’t  mirror  the  displayed  structure  -­‐  modern  complex  web  pages  make  this  difficult  

Page 7: Neil andersonjunhong
Page 8: Neil andersonjunhong

Layout  Engine  

Page 9: Neil andersonjunhong
Page 10: Neil andersonjunhong
Page 11: Neil andersonjunhong

What  now?  

Page 12: Neil andersonjunhong

Our  Visual  Approach  

•  Mimic  human  intui.on  •  To  make  use  of  the  common  sources  of  evidence  on  displayed  pages  that  humans  use,  including  – Structural  regularity  – Visual  and  content  similarity  between  data  records  

 

Page 13: Neil andersonjunhong
Page 14: Neil andersonjunhong
Page 15: Neil andersonjunhong
Page 16: Neil andersonjunhong

Previous  Approaches  Need  to  Iden.fy  Data  Rich  Sec.on  

PiWalls:    How  to  iden.fy  the  Data  Rich  Sec.on  

DRS  does  not  contain  all  the  records  DRS  contains  noise  as  well  as  records  

         

Page 17: Neil andersonjunhong
Page 18: Neil andersonjunhong

Our  Approach  

•  We  find  records,  not  the  Data  Rich  Sec.on  •  Extract  data  records  individually  on  displayed  query  

result  pages,  while  excluding  noise  items  •  Records  in  a  grid  or  a  column  •  Use  clustering  algorithms  and  a  set  of  similarity  

measures  to:  Iden.fy  records  Exclude  noise      

Page 19: Neil andersonjunhong

Our  Approach  

jQuery  

Web    Page  

Renderer    

WebKit  

Visual    Block    

Modeller  

JavaScript  

Seed  Block    

Selector  

JavaScript  

Data    Record  Block    

Selector  

jQuery  

Record  Boundary  Drawer  

Page 20: Neil andersonjunhong

Our  Approach  

jQuery  

Web    Page  

Renderer    

WebKit  

Visual    Block    

Modeller  

JavaScript  

Seed  Block    

Selector  

JavaScript  

Data    Record  Block    

Selector  

jQuery  

Record  Boundary  Drawer  

Page 21: Neil andersonjunhong

Green  and  blue  blocks  

Page 22: Neil andersonjunhong
Page 23: Neil andersonjunhong
Page 24: Neil andersonjunhong

Our  Approach  

jQuery  

Web    Page  

Renderer    

WebKit  

Visual    Block    

Modeller  

JavaScript  

Seed  Block    

Selector  

JavaScript  

Data    Record  Block    

Selector  

jQuery  

Record  Boundary  Drawer  

Page 25: Neil andersonjunhong
Page 26: Neil andersonjunhong

Title  

Page 27: Neil andersonjunhong
Page 28: Neil andersonjunhong

Our  Approach  

jQuery  

Web    Page  

Renderer    

WebKit  

Visual    Block    

Modeller  

JavaScript  

Seed  Block    

Selector  

JavaScript  

Data    Record  Block    

Selector  

jQuery  

Record  Boundary  Drawer  

Page 29: Neil andersonjunhong
Page 30: Neil andersonjunhong

Green  and  blue  blocks  

Page 31: Neil andersonjunhong
Page 32: Neil andersonjunhong

Title  

Page 33: Neil andersonjunhong
Page 34: Neil andersonjunhong

Title  

Page 35: Neil andersonjunhong

Selec.ng  Other  Candidate  Containers  

Filter  the  set  of  all  container  blocks  on  the  page    (blue  blocks)  and  

Discard  blocks  that  don’t  match  the  width  of  any  candidate  container  block  (orange  blocks).  Cluster  the  remaining  blocks  by  width.    

Why  width?  Web  pages  designed  for  ver.cal,  not  horizontal,  scrolling.  

     

Page 36: Neil andersonjunhong

Title  

Page 37: Neil andersonjunhong

Selec.ng  Record  Containers  

Block  content  similarly  measure      Block  A  –  Candidate  record  block  (orange)  Block  B  –  Container  block  (block)  with  the  same  width            as  A  

The  cluster  with  the  maximum  number  of  similar  blocks  is  the  winner!        

Page 38: Neil andersonjunhong

Title  

Page 39: Neil andersonjunhong

Title  

Page 40: Neil andersonjunhong

Our  Approach  

jQuery  

Web    Page  

Renderer    

WebKit  

Visual    Block    

Modeller  

JavaScript  

Seed  Block    

Selector  

JavaScript  

Data    Record  Block    

Selector  

jQuery  

Record  Boundary  Drawer  

Page 41: Neil andersonjunhong

Title  

Page 42: Neil andersonjunhong

Title  

Page 43: Neil andersonjunhong
Page 44: Neil andersonjunhong
Page 45: Neil andersonjunhong
Page 46: Neil andersonjunhong
Page 47: Neil andersonjunhong

Visual  Block  Model  

Page 48: Neil andersonjunhong
Page 49: Neil andersonjunhong

Visual  Block  Model  -­‐  Clean  

Page 50: Neil andersonjunhong
Page 51: Neil andersonjunhong
Page 52: Neil andersonjunhong

Conclusions:  Main  Contribu.ons  

•  Visual  approach  to  directly  access  a  rendering  engine  to  get  posi.onal  and  visual  features  rather  than  codes  or  tag  trees  

•  No  need  to  iden.fy  data  rich  sec.on  •  Use  observa.ons  on  visual  and  content  similarity,  and  structural  regularity  to  group  data  items  into  records  

Page 53: Neil andersonjunhong

Future  Work  

•  Use  a  domain  schema  from  schema.org,  or  a  domain  ontology  to  annotate  data  records  

•  Use  a  domain  schema  or  ontology  to  annotate  query  forms  too  

•  Solve  Label  incompleteness  and  inconsistency  issues  

•  Similarity  threshold  – Set  by  machine  learning.  

Page 54: Neil andersonjunhong