Dissertation proposal

Dissertation proposal

Dissertation proposal нашего земляка в Германии

Peter Siniakov

Projekt FEx, AG DB,

Fachbereich Mathematik und Informatik

FU Berlin

Takustr. 9 14195 Berlin

Dissertation proposal

The area of natural language processing (NLP) gained a lot of attention at the end of 1970es significantly advancing in many research directions such as speech recognition, text understanding, building grammars and conceptual models for natural language. Too optimistic expectations resulting from fast success were not fulfilled and the main goal – understanding and communication in natural language – still remains out of the scope of modern research. In the text-based NLP the focus is being more and more relocated towards solving less complex problems to accomplish very useful tasks in text processing and analysis. One of the most promising efforts in this area is Information Extraction (IE).

Most of the information stored in digital form is hidden in natural language texts. Extracting and storing it in a formal representation (e.g. in form of relations in databases) allows efficient querying and easy administration of the extracted data. Moreover, information stored and queried in a canonical way can be processed and interpreted by computers without human interaction; it can serve for establishing ontologies, creation of knowledge bases and data analysis.

The area of IE comprises techniques, algorithms and methods performing two important tasks: finding (identifying) the desired, relevant data and storing it in appropriate form for future use. The notion of {fact extraction} is often used interchangeably with the notion of IE. The goals of fact extraction, however, are typically more specific and according to them fact extraction can be defined as the transformation of facts expressed in natural language to a given, formal, properly defined target structure. The difference to the classical information extraction task should therefore be underlined where the accent is made mainly on the text processing stage and the target representation is less relevant. Fact extraction can therefore be regarded as a subset of IE extraction focusing on more rigidly structured representation forms.

The rule-based approach was the driving force behind first IE systems and has been still the most widely employed, enhanced and improved method in the area of information extraction. Its principle is in providing human syntactic and semantic knowledge which should be sufficient to handle linguistic diversity in a certain domain. However, it suffers from the fact that rules have to be specified manually, which implies large human effort. Classical rule-based approach may be very appropriate for smaller domains but could hardly be employed in considerably large application domains.

To compensate the insufficiencies of classical rule-based approach human effort should be adequately replaced by alternative methods performed by computer. It can happen either generally abandoning the rule concept and using other proven techniques such as statistical, knowledge-based methods or enhancing rule-based method by learning component. The main goal of my dissertation will be developing an algorithm that learns the extraction rules, improves them so that they can be applied to any text from the specified domain to perform actual fact extraction. The algorithm won’t depend on any domain and should be universally applicable. The amount of human supervision and training effort should be reduced as much as possible. Therefore rules will be derived from instances of facts found in training texts. For this sake a context free language for specification of linguistic patterns will be defined. The new idea hereby is that the generation of rules will be to a large degree guided by the target structure. Besides, the rules may have a non-trivial structure allowing extracting rather complex facts. Another unprecedented component will be the rule generalization that will use lexical and syntactic graphs representing the interconnections between the words and syntactic structures in the domain. These graphs will be constructed while processing training texts. Furthermore, usage of graphs without domain restrictions is considered, since they contain valuable information characterizing the language in general.

It is also intended to design and handle more expressive and powerful target structures than relational target structures used in many IE systems. Embedding the operators of propositional logic by using Horn clauses to express simultaneous validity of two or validity at least of one primitive fact is envisioned.

Computers are more and more involved in administration and analysis of information, which goes far beyond the pure storage. That is why formalizing and structuring stored data (such as natural language texts) becomes increasingly important. IE serves not only to structure, but also to identify relevant data. Human knowledge, provided for example in form of rules is very useful for extraction algorithms, but it is hardly possible to encode it for large domains of natural language. Algorithms that learn to identify facts and obtain linguistic knowledge directly from text using human supervision are one of the most promising approaches to fact and information extraction. Design and implementation of rule learning algorithm for IE will be the subject of my dissertation.

Dissertation proposal
E-mail: rykov2000@mail.ru