Task: Information Extraction for the Semantic Web Solution: Integration of PDT Tools with GATE and Inductive Logic Programming Jan Dědek Department of Software Engineering, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic Pondělní seminář ÚFALu, 9. 1. 2011, MFF UK, Praha
Outline 1 Problem Information Extraction Semantic Annotation Example Tasks 2 Solution Basic Idea Linguistics we Are Using Manually Created Rules Semantic Interpretation Learning of Rules Evaluation 3 Implementation Details Integration of Linguistic Tools (GATE) Integration with Semantic Tools Conclusion
Information Extraction Information Extraction and the Semantic Web The Task of Information Extraction Automatically find the information you re looking for. Pick out the most useful bits. Present it in preferred manner, at the right level of detail. Semantic Web Web as universal medium for the exchange of information. Not only for humans but also for software agents. Main problem today: lack of semantic data on the Web. Extraction of information for the Semantic Web Let s use information extraction to produce semantic data.
Information Extraction Semantic Web Introduction We use semantic web ontologies to express the semantics. RDF, OWL languages Motivated by description logics Concepts or Classes Predicates or Relations Individuals or Instances RDF triples: <Subject> <Predicate> <Object> RDF triples form a named oriented graph Basic data structure of the Semantic Web
Information Extraction Ontology (example) PROTON (PROTo ONtology) http://proton.semanticweb.org/
Semantic Annotation 1 Problem Information Extraction Semantic Annotation Example Tasks 2 Solution Basic Idea Linguistics we Are Using Manually Created Rules Semantic Interpretation Learning of Rules Evaluation 3 Implementation Details Integration of Linguistic Tools (GATE) Integration with Semantic Tools Conclusion
Semantic Annotation Semantic Annotation (http://www.ontotext.com/kim/)
Example Tasks 1 Problem Information Extraction Semantic Annotation Example Tasks 2 Solution Basic Idea Linguistics we Are Using Manually Created Rules Semantic Interpretation Learning of Rules Evaluation 3 Implementation Details Integration of Linguistic Tools (GATE) Integration with Semantic Tools Conclusion
Example Tasks Example of the web-page with a report of a fire department > home > navigace > vyhledávání > změna vzhledu Informace z resortu o tom, co se stalo, co se děje i co se připravuje Zubatého 1, 614 00 Brno, telefon 950 630 111, http://www.firebrno.cz Zpravodajství v roce 2006 15.05.2007 V trabantu zemřeli dva lidé K tragické nehodě dnes odpoledne hasiči vyjížděli na silnici z obce Česká do Kuřimi na Brněnsku. Nehoda byla operačnímu středisku HZS ohlášena ve 13.13 hodin a na místě zasahovala jednotka profesionálních hasičů ze stanice v Tišnově. Jednalo se o čelní srážku autobusu Karosa s vozidlem Trabant 601. Podle dostupných informací trabant jedoucí ve z Brna do Kuřimi zřejmě vyjel do protisměru, kde narazil do linkového autobusu dopravní společnosti ze Žďáru nad Sázavou. Ve zdemolovaném trabantu na místě zemřeli dva muži 82letý senior a další muž, jehož totožnost zjišťují policisté. Hasiči udělali na vozidle protipožární opatření a po vyšetření a zadokumentování nehody dopravní policií vrak trabantu zaklesnutý pod autobusem pomocí lana odtrhli. Po odstranění střechy trabantu pak z kabiny vyprostili těla obou mužů. Obě vozidla trabant i autobus, pak postupně odstranili na kraj vozovky a uvolnili tak jeden jízdní pruh. Únik provozních kapalin nebyl zjištěn. Po 16. hodině pomohli vrak trabantu naložit k odtahu a asistovali při odtažení autobusu. Po úklidu vozovky krátce před 16.30 hod. místo nehody předali policistům a ukončili zásah. Hasiči Generální ředitelství hl. m. Praha Jihočeský kraj Jihomoravský kraj Karlovarský kraj Královéhradecký kraj Liberecký kraj Moravskoslezský kraj Olomoucký kraj Pardubický kraj Plzeňský kraj Středočeský kraj Ústecký kraj kraj Vysočina Zlínský kraj V této rubrice Zpravodajství Aktualizace stránek Archiv zpravodajství Bleskové zpravodajství RSS Boj proti korupci Digitální televize Hasiči Hlavní zprávy Ministerstvo Od dopisovatelů (neoficiální) Policie Regiony Servis nejen pro novináře Schengenská spolupráce WebEditorial Na našem serveru v jiných rubrikách Aktuality Národního archivu
Example Tasks Text of an Accident Report and Contained Information fire started at 3 amateurx units Požár byl operačnímu středisku HZS ohlášen dnes ve 2.13 hodin, na místo vyjeli profesionální hasiči ze stanice v Židlochovicích a dobrovolní hasiči z Židlochovic, Žabčic a Přísnotic, Oheň, který finished zasáhl at 4:03 elektroinstalaci u chladícího boxu, hasiči dostali pod kontrolu ve 2.32 hodin a uhasili tři minuty po třetí hodině. Příčinou vzniku požáru byla technická závada, škodu vyšetřovatel předběžně vyčíslil na osm tisíc korun. damage 8 000 CZK id_47443 Information to be extracted is decorated.
Example Tasks Acquisitions Corpus Corporate Acquisition Events Acquisitions v1.1 version 1 1 from the Dot.kom project s resources: http://nlp.shef.ac.uk/dot.kom/resources.html
1 Problem Information Extraction Semantic Annotation Example Tasks 2 Solution Basic Idea Linguistics we Are Using Manually Created Rules Semantic Interpretation Learning of Rules Evaluation 3 Implementation Details Integration of Linguistic Tools (GATE) Integration with Semantic Tools Conclusion
Basic Idea How to extract the information about the damage of the accident? fire started at 3 amateurx units Požár byl operačnímu středisku HZS ohlášen dnes ve 2.13 hodin, na místo vyjeli profesionální hasiči ze stanice v Židlochovicích a dobrovolní hasiči z Židlochovic, Žabčic a Přísnotic, Oheň, který finished zasáhl at 4:03 elektroinstalaci u chladícího boxu, hasiči dostali pod kontrolu ve 2.32 hodin a uhasili tři minuty po třetí hodině. Příčinou vzniku požáru byla technická závada, škodu vyšetřovatel předběžně vyčíslil na osm tisíc korun. damage 8 000 CZK id_47443 How to extract the information about the damage of the accident? See the last sentence on the next slide.
Basic Idea Corresponding linguistic tree reckon id_47443_p1s2 thousand CZK damage investigating officer eight, škodu vyšetřovatel předběžně vyčíslil na osm tisíc korun., investigating officer preliminarily reckoned the damage to be 8 000 CZK. Basic Idea: use tree queries (tree patterns) to extract the information.
Basic Idea Introduction of Our Solution Extraction of semantic information from texts. Exploiting of linguistic tools. Mainly from the Prague Dependency Treebank project. Related tools language analyzers (TectoMT), Netgraph, etc. Experiments with the Czech WordNet. Rule based extraction method. Extraction rules tree queries ILP learning of extraction rules
Basic Idea Schema of the extraction process 1 Extraction of text Using RSS feed to download pages. Regular expression to extract text. 2 Linguistic annotation Using chain of 6 linguistic tools (see on next slides). 3 Data extraction Exploitation of linguistic trees. Using extraction rules. 4 Semantic representation of data Ontology needed. Semantic interpretation of rules. Far from finished in current state.
Linguistics we Are Using Layers of linguistic annotation in PDT Tectogrammatical layer Analytical layer Morphological layer PDT 2.0 on-line: http://ufal.mff.cuni.cz/pdt2.0/ Sentence: Byl by šel dolesa. He-was would went toforest.
Linguistics we Are Using Tools for machine linguistic annotation 1 Segmentation and tokenization 2 Morphological analysis 3 Morphological tagging 4 McDonnald s Maximum Spanning Tree parser Czech adaptation 5 Analytical function assignment 6 Tectogrammatical analysis Developed by Václav Klimeš Available within the TectoMT 2 project 2 http://ufal.mff.cuni.cz/tectomt/
Linguistics we Are Using Example of an output tectogrammatical tree T-jihomoravsky49640.txt-001-p1s4 root zemřít PRED v #PersPron Trabant ACT LOC. basic n.pron.def.persn.denot #Dash APPS coap Lemmas Functors Semantic parts of speech zdemolovaný RSTR adj.denot místo LOC. basic n.denot muž ACT n.denot dva RSTR adj.quant.def a CONJ coap senior DENOM n.denot 82letý RSTR adj.denot muž DENOM n.denot další RSTR adj.denot zjišťovat RSTR v totožnost policista ACT ACT n.denot.negn.denot který APP n.pron.indef Sentence: Ve zdemolovaném trabantu na místě zemřeli dva muži 82letý senior a další muž, jehož totožnost zjišt ují policisté. Two men died on the spot in demolished trabant...
Manually Created Rules 1 Problem Information Extraction Semantic Annotation Example Tasks 2 Solution Basic Idea Linguistics we Are Using Manually Created Rules Semantic Interpretation Learning of Rules Evaluation 3 Implementation Details Integration of Linguistic Tools (GATE) Integration with Semantic Tools Conclusion
T-jihomoravsky49640.txt-001-p1s4 root zemřít PRED v #PersPron ACT n.pron.def.pers zdemolovaný RSTR adj.denot Trabant LOC. basic n.denot místo LOC. basic n.denot #Dash APPS coap muž ACT n.denot a CONJ coap How to extract the information about two dead people? dva RSTR adj.quant.def senior DENOM n.denot muž DENOM n.denot 82letý RSTR adj.denot další RSTR adj.denot zjišťovat RSTR v... two... totožnost ACT n.denot.neg který APP n.pron.indef policista ACT n.denot
Manually Created Rules Extraction rules Netgraph queries Tree patterns on shape and nodes (on node attributes). Evaluation gives actual matches of particular nodes. Names of nodes allow use of references.
Manually Created Rules Raw data extraction output <QueryMatches> <Match root_id="t-vysocina63466.txt-001-p1s4" match_string="2:0,7:3,8:4,11:2"> <Sentence> Při požáru byla jedna osoba lehce zraněna - jednalo se o majitele domu, který si vykloubil rameno. </Sentence> <Data> <Value variable_name="action_type" attribute_name="t_lemma">zranit</value> <Value variable_name="injury_manner" attribute_name="t_lemma">lehký</value> <Value variable_name="participant" attribute_name="t_lemma">osoba</value> <Value variable_name="quantity" attribute_name="t_lemma">jeden</value> </Data> </Match> <Match root_id="t-jihomoravsky49640.txt-001-p1s4" match_string="1:0,13:3,14:4"> <Sentence> Ve zdemolovaném trabantu na místě zemřeli dva muži - 82letý senior a další muž, jehož totožnost zjišťují policisté. </Sentence> <Data> <Value variable_name="action_type" attribute_name="t_lemma">zemřít</value> <Value variable_name="participant" attribute_name="t_lemma">muž</value> <Value variable_name="quantity" attribute_name="t_lemma">dva</value> </Data> </Match> <Match root_id="t-jihomoravsky49736.txt-001-p4s3" match_string="1:0,3:3,7:1"> <Sentence>Čtyřiatřicetiletý řidič nebyl zraněn.</sentence> <Data> <Value variable_name="action_type" attribute_name="t_lemma">zranit</ Value> <Value variable_name="a-negation" attribute_name="m/tag">vpys---xr- N A--- </Value> <Value variable_name="participant" attribute_name="t_lemma">řidič</value> </Data> </Match> </QueryMatches> SELECT action_type.t_lemma, a-negation.mtag, injury_manner.t_lemma, participant.t_lemma, quantity.t_lemma FROM ***extraction rule***
Manually Created Rules Extraction rules Environment Protection Use Case unit amount where material
Manually Created Rules Matching Tree "Due to the clash the throat of fuel tank tore off and 800 litres of oil (diesel) has run out to a stream." Nárazem se utrhl hrdlo palivové nádrže a do potoka postupně vyteklo na 800 litrů nafty. jihmor56559.txt-001-p1s3 (1) (3) (2) litre (5) diesel (4) "into" water stream
Manually Created Rules Experimental results extracted data Raw data extraction output <QueryMatches> <Match root_id="jihmor56559.txt-001-p1s3" match_string="15:0,16:4,22:1,23:2,27:3"> <Sentence>Nárazem se utrhl hrdlo palivové nádrže a do potoka postupně vyteklo na 800 litrů nafty.</sentence> <Data> litre <Value variable_name="amount" attribute_name="t_lemma">800</value> <Value variable_name="unit" attribute_name="t_lemma">l</value> <Value variable_name="material" attribute_name="t_lemma">nafta</value> <Value variable_name="where" attribute_name="t_lemma">potok</value> </Data> water stream diesel </Match> <Match root_id="jihmor68220.txt-001-p1s3" match_string="3:0,12:4,21:1,22:2,27:3"> <Sentence>Z palivové nádrže vozidla uniklo do půdy v příkopu vedle silnice zhruba 350 litrů nafty, a proto byli o události informováni také pracovníci odboru životního prostředí Městského úřadu ve Vyškově a České inspekce životního prostředí.</sentence> <Data> <Value variable_name="amount" attribute_name="t_lemma">350</value> <Value variable_name="unit" attribute_name="t_lemma">l</value> <Value variable_name="material" attribute_name="t_lemma">nafta</value> <Value variable_name="where" attribute_name="t_lemma">půda</value> </Data> </Match> soil... SELECT amount.t_lemma, unit.t_lemma, material.t_lemma, where.t_lemma FROM ***extraction rule***
Semantic Interpretation 1 Problem Information Extraction Semantic Annotation Example Tasks 2 Solution Basic Idea Linguistics we Are Using Manually Created Rules Semantic Interpretation Learning of Rules Evaluation 3 Implementation Details Integration of Linguistic Tools (GATE) Integration with Semantic Tools Conclusion
Semantic Interpretation Semantic interpretation of extraction rules m/tag t_lemma Incident actionmanner String* negation Boolean actiontype String hasparticipant Instance* Participant t_lemma t_lemma + numeral translation hasparticipant* Participant participanttype String participantquantity Integer t_lemma Determines how particular values of attributes are used. Gives semantics to extraction rule. Gives semantics to extracted data.
Semantic Interpretation Semantic data output incident_49640 negation = false actiontype = death hasparticipant = participant_49640_1 incident_49736 negation = true actiontype = injury hasparticipant = participant_49736_1 hasparticipant hasparticipant er 1 participant_49640_1 participanttype = man participantquantity = ~@nonnegativeinteger 2 participant_49736_1 participanttype = driver Two instances of two ontology classes.
Semantic Interpretation The experimental ontology Incident actionmanner String* negation Boolean actiontype String hasparticipant Instance* Participant hasparticipant* Participant participanttype String participantquantity Integer Two classes Incident and Participant One object property relation hasparticipant Five datatype property relations actionmanner (light or heavy injury) negation actiontype (injury or death) participanttype (man, woman, driver, etc.) participantquantity
Semantic Interpretation Design of extraction rules iterative process Frequency Analysis Key-words Query Coverage tuning More accurate matches Tree Query Matching trees Investigation of neighbors More complex queries 1 Frequency analysis representative key-words. 2 Investigating of matching trees tuning of tree query. 3 Complexity of the query = complexity of extracted data.
Semantic Interpretation Corpus of Fire-department articles Fire-department articles Published by The Ministry of Interior of the Czech Republic 3 Processed more than 800 articles from different regions of Czech Republic 1.2 MB of textual data Linguistic tools produced 10 MB of annotations, run time 3.5 hours Extracting information about injured and killed people 470 matches of the extraction rule, 200 numeric values of quantity (described later) 3 http://www.mvcr.cz/rss/regionhzs.html
Learning of Rules 1 Problem Information Extraction Semantic Annotation Example Tasks 2 Solution Basic Idea Linguistics we Are Using Manually Created Rules Semantic Interpretation Learning of Rules Evaluation 3 Implementation Details Integration of Linguistic Tools (GATE) Integration with Semantic Tools Conclusion
Learning of Rules Inductive Logic Programming Inductive Logic Programming (ILP) is a Machine Learning procedure for multirelational learning Heuristic and iterative method, learning is usually slow It is capable to deal with graph or tree structures naturally Learns form positive and negative examples Positive and negative tree nodes It is necessary to label tree nodes from corresponding labeled text (not trivial problem) Learned rules are strict (no weights, probabilities, etc.) Easier human understanding, modification Possibility of sharing of rules amongst different tools Lower performance (precision, recall)
Learning of Rules Context of Our Experiments Semantic Web & Semantic Data Extraction Integration of ILP in our extraction process Web Texts Linguistic trees ILP background knowledge Extraction process Extraction rules Human annotator Learning examples + Semantics ILP learning Extracted Semantic data Get semantics form Web of today Czech pages, Czech texts Czech linguistic tools Domain of traffic accidents Semantics given by human Generalized & extracted by ILP Main point: transformation of trees to logic representation. Human annotator does not need to be a linguistic expert.
Learning of Rules Logic representation of linguistic trees tree_root(node0_0). node(node0_0). id(node0_0, t_jihomoravsky49640_txt_001_p1s4). %%%%%%%% node0_1 %%%%%%%%%%%%%%%%%% node(node0_1). functor(node0_1, pred). gram_sempos(node0_1, v). t_lemma(node0_1, zemrit). %%%%%%%% node0_2 %%%%%%%%%%%%%%%%%% node(node0_2). functor(node0_2, act). gram_sempos(node0_2, n_pron_def_pers). t_lemma(node0_2, x_perspron). %%%%%%%% node0_3 %%%%%%%%%%%%%%%%%%% node(node0_3). id(node0_3, functor(node0_3, loc). gram_sempos(node0_3, n_denot). t_lemma(node0_3, trabant).... edge(node0_0, node0_1). edge(node0_1, node0_2). edge(node0_1, node0_3). edge(node0_3, node0_4). edge(node0_4, node0_5). edge(node0_3, node0_6). edge(node0_3, node0_7). edge(node0_3, node0_8).... Logic representation Source web page Linguistic trees
Learning of Rules Root/Subtree Preprocessing/Postprocessing (Chunk learning) reckon thousand CZK Root damage investigating officer eight Sub-tree, škodu vyšetřovatel předběžně vyčíslil na osm tisíc korun., investigating officer preliminarily reckoned the damage to be eight thousand Crowns (CZK).
Learning of Rules Examples of learned rules, Czech words are translated. Example [Rule 1] [Pos cover = 14 Neg cover = 0] damage_root(a) :- lex_rf(b,a), has_sempos(b, n.quant.def ), tdependency(c,b), tdependency(c,d), has_t_lemma(d, investigator ). [Rule 2] [Pos cover = 13 Neg cover = 0] damage_root(a) :- lex_rf(b,a), has_functor(b, TOWH ), tdependency(c,b), tdependency(c,d), has_t_lemma(d, damage ). [Rule 1] [Pos cover = 7 Neg cover = 0] injuries(a) :- lex_rf(b,a), has_functor(b, PAT ), has_gender(b,anim), tdependency(b,c), has_t_lemma(c, injured ). [Rule 8] [Pos cover = 6 Neg cover = 0] injuries(a) :- lex_rf(b,a), has_gender(b,anim), tdependency(c,b), has_t_lemma(c, injure ), has_negation(c,neg0).
Evaluation Evaluation results task/method matching missing excess overlap prec.% recall% F1.0% damage/ilp 14 0 7 6 51.85 70.00 59.57 damage/ilp lenient measures 74.07 100.00 85.11 dam./ilp-roots 16 4 2 0 88.89 80.00 84.21 damage/paum 20 0 6 0 76.92 100.00 86.96 injuries/ilp 15 18 11 0 57.69 45.45 50.85 injuries/paum 25 8 54 0 31.65 75.76 44.64 inj./paum-afun 24 9 38 0 38.71 72.73 50.53 10-fold cross validation Two tasks: damage and injuries Root/subtree preprocessing/postprocessing used for damage task
1 Problem Information Extraction Semantic Annotation Example Tasks 2 Solution Basic Idea Linguistics we Are Using Manually Created Rules Semantic Interpretation Learning of Rules Evaluation 3 Implementation Details Integration of Linguistic Tools (GATE) Integration with Semantic Tools Conclusion
Integration of Linguistic Tools (GATE) GATE GATE: General Architecture for Text Engineering The University of Sheffield http://gate.ac.uk/ Implemented Batch TectoMT Language Analyzer Transformation of PDT annotations to GATE Netgraph used as a tree viewer Works also for Standford Depndencies
Integration of Linguistic Tools (GATE) PDT in GATE
Integration of Linguistic Tools (GATE) Netgraph Tree Viewer in GATE (for Stanford Dependencies) Sentence: Users also said it is in the strongest financial position in its 24-year history.
Integration with Semantic Tools 1 Problem Information Extraction Semantic Annotation Example Tasks 2 Solution Basic Idea Linguistics we Are Using Manually Created Rules Semantic Interpretation Learning of Rules Evaluation 3 Implementation Details Integration of Linguistic Tools (GATE) Integration with Semantic Tools Conclusion
Integration with Semantic Tools Transformation of PML to RDF Quite simple XSLT transformation Allows working with PDT annotations inside Semantic Web tools Ontology Editors Reasoners Query tools (graph queries)?visualization and navigation tools? In our case interpretation of extraction rules by a OWL reasoner
Integration with Semantic Tools Extraction Rules Interpreted by OWL Reasoner Document TectoMT GRDDL XSLT Document Ontology Information Extraction engine Reasoner TectoMT Learning Corpus Extraction Rules Extraction Ontology Annotated Document Ontology Tool independent extraction ontologies
Integration with Semantic Tools PDT in The Protégé Ontology Editor
Integration with Semantic Tools Examples of extraction rules in the native Prolog format. [Rule 1] [Pos cover = 23 Neg cover = 6] mention_root(acquired,a) :- lex.rf (B,A), t_lemma(b, Inc ), tdependency(c,b), tdependency(c,d), formeme(d, n:in+x ), tdependency(e,c). [Rule 11] [Pos cover = 25 Neg cover = 6] mention_root(acquired,a) :- lex.rf (B,A), t_lemma(b, Inc ), tdependency(c,b), formeme(c, n:obj ), tdependency(c,d), functor(d, APP ). [Rule 75] [Pos cover = 14 Neg cover = 1] mention_root(acquired,a) :- lex.rf (B,A), t_lemma(b, Inc ), functor(b, APP ), tdependency(c,b), number(c,pl).
Integration with Semantic Tools Examples of extraction rules in Protégé 4 Rules View s format [Rule 1] lex.rf(?b,?a), t_lemma(?b, "Inc"), tdependency(?c,?b), tdependency(?c,?d), formeme(?d, "n:in+x"), tdependency(?c,?e) -> mention_root(?a, "acquired") [Rule 11] lex.rf(?b,?a), t_lemma(?b, "Inc"), tdependency(?c,?b), formeme(?c, "n:obj"), tdependency(?c,?d), functor(?d, "APP") -> mention_root(?a, "acquired") [Rule 75] lex.rf(?b,?a), t_lemma(?b, "Inc"), functor(?b, "APP"), tdependency(?c,?b), number(?c, "pl") -> mention_root(?a, "acquired")
Conclusion 1 Problem Information Extraction Semantic Annotation Example Tasks 2 Solution Basic Idea Linguistics we Are Using Manually Created Rules Semantic Interpretation Learning of Rules Evaluation 3 Implementation Details Integration of Linguistic Tools (GATE) Integration with Semantic Tools Conclusion
Conclusion Summary Implemented a system for extraction of semantic information Based on third party linguistic tools (TectoMT 4 ) Extraction rules adopted from Netgraph 5 application. ILP used for learning rules. All methods integrated inside GATE 6. Main advantages: Automated selection of learning features Language independent Rule based 4 http://ufal.mff.cuni.cz/tectomt/ 5 http://quest.ms.mff.cuni.cz/netgraph/ 6 http://gate.ac.uk/
Conclusion Future work Use some Knowledge Base (e.g. WordNet). Adaptation of this method on other languages. Evaluation of the method on other datasets. Be able to provide more semantics. e.g. sophisticated semantic interpretation of extracted data