{"id":1884,"date":"2019-08-14T02:17:57","date_gmt":"2019-08-14T02:17:57","guid":{"rendered":"http:\/\/kusuaks7\/?p=1489"},"modified":"2024-05-02T09:49:54","modified_gmt":"2024-05-02T09:49:54","slug":"measuring-without-labels-a-different-approach-to-information-extraction","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/measuring-without-labels-a-different-approach-to-information-extraction\/","title":{"rendered":"Measuring without labels: a different approach to information extraction"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"1884\" class=\"elementor elementor-1884\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-3214b1b4 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"3214b1b4\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-75603bab\" data-id=\"75603bab\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-205c9ce0 elementor-widget elementor-widget-text-editor\" data-id=\"205c9ce0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tInformation extraction (IE) is a broad area in both the natural language processing (NLP) and the Web communities. The main goal of IE is to extract useful information from raw documents and webpages. For example, given a product webpage, one might want to extract attributes like the name of the product, its date of production, price, and seller.\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-79e2049 elementor-widget elementor-widget-image\" data-id=\"79e2049\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"http:\/\/blogs.biomedcentral.com\/on-society\/wp-content\/uploads\/sites\/13\/2019\/07\/background-canvas-code-249798-620x342.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0018e7f elementor-widget elementor-widget-text-editor\" data-id=\"0018e7f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tA few decades ago, such extractors were built painstakingly using rules. In the modern artificial intelligence (AI) community, IE is done using machine learning.\u00a0<em>Supervised machine learning<\/em>\u00a0methods take a\u00a0<em>training set<\/em>\u00a0of webpages, with gold standard extractions, and learn an IE function based on statistical models like conditional random fields and even deep neural nets.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7924d33 elementor-widget elementor-widget-text-editor\" data-id=\"7924d33\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTraditional IE,\u00a0<a href=\"https:\/\/appliednetsci.springeropen.com\/articles\/10.1007\/s41109-019-0154-z\" target=\"_blank\" rel=\"noopener noreferrer\">assumed in our article<\/a>, assumes a particular\u00a0<em>schema<\/em>\u00a0according to which information must be extracted and typed. A schema for e-commerce might include the attributes mentioned earlier, such as price and date. Domain-specific applications, such as human trafficking, generally require the schema to be specific and fine-grained, supporting attributes of interest to investigators, including phone number, address and also physical features such as hair color and eye color.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6bf70bc elementor-widget elementor-widget-image\" data-id=\"6bf70bc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"http:\/\/blogs.biomedcentral.com\/on-society\/wp-content\/uploads\/sites\/13\/2019\/07\/hiding-1209131_1920-1024x678.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-453f95b elementor-widget elementor-widget-text-editor\" data-id=\"453f95b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\"><span style=\"font-size: 10px;\">The analysis of information related to human trafficking information has proven difficult through the lens of natural language processing.<\/span><\/p>\n<p style=\"text-align: center;\"><span style=\"font-size: 10px;\">Image by Free-Photos from Pixabay<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-269a9a6 elementor-widget elementor-widget-text-editor\" data-id=\"269a9a6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAI is not a solved problem, and IE has proven to be a difficult problem in the NLP community for non-traditional domains like human trafficking. Unlike ordinary domains, such as news and e-commerce, training and evaluation datasets for human trafficking IE are not available, and it is also not straightforward to collect such labeled datasets using traditional methods like crowdsourcing, due to the sensitivity of the domain.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-77b9042 elementor-widget elementor-widget-text-editor\" data-id=\"77b9042\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhile minimally supervised or even unsupervised IE (which only require a few, or even no, labels) can be used, especially with a good sprinkling of domain knowledge, the problem of evaluation remains. Simply put: how can one know whether the IE is good enough without a large, labeled dataset?\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-782df45 elementor-widget elementor-widget-text-editor\" data-id=\"782df45\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nThe conventional answer is that this is not possible. But we argue that, given a sufficiently large set of documents over which an IE program has been executed, we can use the\u00a0<em>dependencies\u00a0<\/em>between extractions to reason about IE performance.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-09301a0 elementor-widget elementor-widget-text-editor\" data-id=\"09301a0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhat do we mean by \u2018a dependency?\u2019\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5e4a0e5 elementor-widget elementor-widget-image\" data-id=\"5e4a0e5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"http:\/\/blogs.biomedcentral.com\/on-society\/wp-content\/uploads\/sites\/13\/2019\/07\/mayank.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9922eb9 elementor-widget elementor-widget-text-editor\" data-id=\"9922eb9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\"><span style=\"font-size: 10px;\">An example of an attribute extraction network, assuming the attribute Name. Vertices are documents.<\/span><\/p>\nConsider the network illustrated in the figure above. In this kind of network, called attribute extraction network (AEN), we model each document as a node. An edge exists between two nodes if their underlying documents share an extraction (in this case, names). For example, documents D1 and D2 are connected by an edge because they share the extraction \u2018Mayank.\u2019 Note that constructing the AEN only requires the output of an IE, not a gold standard set of labels.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6a9a8d5 elementor-widget elementor-widget-text-editor\" data-id=\"6a9a8d5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOur primary hypothesis in the article was that, by measuring network-theoretic properties (like the degree distribution, connectivity, etc.) of the AEN, correlations would emerge between these properties and IE performance metrics like precision and recall, which require a sufficiently large gold standard set of IE labels to compute. The intuition is that IE noise is not random noise and that the non-random nature of IE noise will show up in the network metrics. Why is IE noise non-random? We believe that it is due to ambiguity in the real world over some terms, but not others.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6d62653 elementor-widget elementor-widget-text-editor\" data-id=\"6d62653\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFor example, \u2018Charlotte\u2019 is a more ambiguous location term than \u2018London,\u2019 since Charlotte is the name of a (fairly well known) city in the US state of North Carolina, whereas London is predominantly used in a location context (though it may occasionally emerge as someone\u2019s name). The hypothesis is that, if an IE system mis-extracts Charlotte from a document once, it will (all things equal) mis-extract it from other documents as well. In other words, IE mistakes are not random incidents. By mapping IE outputs as a network, we are able to quantify the non-random nature of these mistakes and estimate the performance of the IE on the dataset as a whole.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7199f6e elementor-widget elementor-widget-text-editor\" data-id=\"7199f6e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tBecause we had access to such a gold standard set, painstakingly constructed by social scientists over millions of sex advertisements scraped from the Internet, we were able to study these correlations across three important attributes (names, phone numbers, and locations) and determine that such correlations do exist, especially for the precision metric.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1de0280 elementor-widget elementor-widget-text-editor\" data-id=\"1de0280\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn our experiments, we construct an AEN for each extraction class of interest, such as a separate AEN is constructed for phone number extractions, name extractions (as in the figure, etc.) In the future, an interesting line of work that we are looking into is to construct a multi-network where all the extractions are modeled jointly in a single multi-class. Our hypothesis is that our results and predictions can be improved even further by considering such joint models.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-de2f7aa elementor-widget elementor-widget-text-editor\" data-id=\"de2f7aa\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe takeaway of this work is that, in AI systems that exhibit dependencies, ours may be an exciting methodology for studying the performance of these systems. Historically, network science has been used primarily for the study of \u2018non-abstract\u2019 interactions that are typically amenable to observations such as friendships, co-citations, and protein-protein interactions. In contrast, the AEN is a highly abstract network of an IE system\u2019s outputs, albeit still based on actual observations. However, its consequences are very real: it can be used to compare and (approximately) evaluate systems without an actual, painstakingly acquired ground truth.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Information extraction is a major problem in the fields of natural language processing and web mining, in particular when it comes to evaluating domains where language cannot be taken at face value. In modern artificial intelligence (AI) community, information extraction is done using machine learning.&nbsp;Supervised machine learning&nbsp;methods take training set&nbsp;of webpages, with gold standard extractions, and learn an IE function based on statistical models like conditional random fields and even deep neural nets.<\/p>\n","protected":false},"author":619,"featured_media":3622,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[187],"tags":[95],"ppma_author":[3331],"class_list":["post-1884","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-big-data-amp-technology"],"authors":[{"term_id":3331,"user_id":619,"is_guest":0,"slug":"mayank-kejriwal","display_name":"Mayank Kejriwal","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Kejriwal","first_name":"Mayank","job_title":"","description":"Dr. Mayank Kejriwal is a research lead at the University of Southern California&rsquo;s Information Sciences Institute, and a research assistant professor at USC&rsquo;s Department of Industrial and Systems Engineering. &nbsp;He has delivered talks, tutorials, demonstrations and workshops at over 20 international academic and industrial venues, published more than 30 peer-reviewed articles and papers, and currently co-authoring two books on knowledge graphs. In 2018, he was awarded a Key Scientific Challenge Award by the Allen Institute for Artificial Intelligence and was designated a Forbes under 30 Scholar.&nbsp;"}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1884","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/619"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1884"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1884\/revisions"}],"predecessor-version":[{"id":36830,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1884\/revisions\/36830"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3622"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1884"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1884"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1884"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1884"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}