BAN432 Applied Textual Data Analysis for Business and Finance
Recent years have seen an increase of textual data that is accessible in electronic form. Examples are annual reports, company releases, and newspaper articles, or user generated content on social media such as blogs, forums and tweets. All of these text data are generated by humans and may thus contain information about the author`s opinions and preferences. Textual data analysis is the process of deriving high-quality information from text, which can subsequently be used for economic decision making. Employing computers to process textual data allows for (1) analysing digital information more quickly than what is possible for the human mind, (2) detecting high dimensional patterns, and (3) conducting structural analyses on textual data.
While this course primarily looks into applications for finance, textual data analysis can be applied to other fields as well.
In this course the following topics are covered:
- Introduction to R: The course is based on the free software environment R. Students will get an introduction to the basic concepts in R which are relevant for text mining. Students will learn to use websources such as R vignettes and stackoverflow.com effectively.
- Data collection: Students will learn how to write R-scripts to collect data from different sources. In particular, this course discusses Application Programming Interfaces (API), for example those provided by Google, Twitter, or New York Times. The course introduces the students to techniques of retrieving company filings, such as Annual Reports and Merger Prospectus through the Electronic Data Gathering, Analysis, and Retrieval system (EDGAR).
- Preprocessing and structuring of data: Most of the retrieved data contains unwanted mark-ups like html-tags or non-words that have to be removed before compiling a corpus (collection of texts). The students will learn how to use regular expressions in R, in order to filter the data and prepare it for analysis.
- Data analysis and application: During the course the students will learn to analyse text data in various ways, and apply it to a finance context. In particular the course will introduce the students to the following concepts:
- Sentiment Analysis evaluates the tone (positive/negative) of a document. This can be used to judge the attitude of a given firm announcement, which in turn could be implemented into an automated trading algorithm.
- Social Media Analysis is used to extract structural information from data provided by the actual consumer (e.g. Tweets). For example, this can be used to measure the customer's attitude towards new products.
- Document Clustering evaluates the similarity of texts. This can be used to cluster companies to study economic linkages such as a product competition or a supplier relationship.
- Natural Language Processing (NLP) and Corpus Linguistics: termness, part-of-speech tagging, collocations, representativeness, etc.
- Case Study: The course will include a guest lecture where the students will be introduced to relevant corpora. The lecturer will discuss the practical challenges of constructing a large-scale corpus and making it accessible. Central concepts in computational/corpus linguistics and phraseology, including n-grams, keyness, collocations, statistical measures of association, dispersion, etc. will be introduced. This case study will explore the characteristics of business news discourse contrasted with other types of discourse.
- Critical understanding: While computers work more efficient than humans, they are unable to understand text in the way of the human mind. The students will (among other things) be introduced to the following limitations:
- Context. In everyday language the words "liability" and "debt" have a negative undertone, while in a business context they only refer to a certain form of capital.
- Reading between the lines. A computer is unable to detect sarcasm, irony, metaphors, etc.
- Classification of words. "Up" is often classified as positive, which would not be the case with reference to unemployment.
- Negation: The sentiment of a given word is inverted by including the word "not", which might be difficult for a computer to understand.
After completing the course successfully, the students will have acquired the knowledge, practical skills and competence as specified below:
KNOWLEDGE - The candidate...
- will be familiar with packages in the R software environment that are relevant for text mining
- will have knowledge about important concepts in Natural language processing (NLP) and corpus linguistics
- will have a critical and reflective understanding about the text mining approaches presented in the course
- will have an academic basis that prepares for future studies within the field of business and finance using textual data analysis
SKILLS - The candidate will be able to...
- collect textual data from digital sources relevant for finance and business
- preprocess textual data to make it accessible for the actual analysis
- analyse the retrieved data using natural language processing (NLP)
- visualise results
- use the results to guide economic decision making
COMPETENCE - The candidate...
- has the knowledge necessary to carry out textual data analysis in order to guide economic decisions
This course focuses on practice. All theoretical concepts are applied by implementing the actual code in R.
This course is taught using a combination of regular lectures and examples in R. The lectures are aimed at providing the core information concerning the principles of analysing textual data and their application in finance and business. All subjects discussed will be implemented using R. Students have to bring their laptops to class and work with the examples individually or in groups.
Although not required, a basic knowledge of R is recommended. Students without any prior knowledge in R are expected to keep up with the course's demands independently.
While not a formal prerequisite, students will benefit from having attended BUS455 Applied programming and data analysis for business. The focus of this course is different from FIE446 Financial Engineering, but there will be synergies due to the use of the same programming language.
Credit reduction due to overlap
Course identical to FIE452.
Requirements for course approval
Three written homework assignments have to be handed in.
The final grade is based on the work with a project that is handed in as a written report (50%) and a 15 minute oral presentation of the findings in the report (50%). The work with the final project and the presentation is group based. All group members will have to take actively part in the writing of the report and the oral presentation.
The students will have one week to complete the final project.
Grading: The grading will be based on the correctness of the results, the clarity and persuasiveness of each assignment, as well as the student's ability to learn and implement new concepts. Students are expected to relate their final project to the relevant literature. The final project and the homework assignments have to be submitted in English. Students who want to retake the course, have to retake both the final project and the homework assignments. All elements have to be submitted in the same semester.
This course is a continuation of FIE452 and the total number of attempts applies to the course (not the course code).
Laptop with a working installation of R. We will work with R during the course and students are required to bring their laptops to class.
We recommend using a user interface such as R-Studio (or others).
Documentation and vignettes for the used R-packages
Introductions to R:
- Lam, Longhow: An Introduction to R., available on cran.r-project.org
- Venables, W.N./Smith, D.M., 2016, An Inroduction to R., available on cran.r-project.org
- Paradis, E., 2011, R for Beginners, available on cran.r-project.org
- García, D., 2013, Sentiment During Recessions. The Journal of Finance 68(3). 1267-1300.
- García, D./Norli, Ø., 2012, Crawling EDGAR. The Spanish Review of Financial Economics 10. 1-10.
- García, D./Norli, Ø., 2012, Geographic Dispersion and Stock Returns. Journal of Financial Economics 106. 547-565.
- Hoberg, G./Phillips, G., 2016, Text-Based Network Industries and Endogenous Product Differentiation. Journal of Political Economy 124(5). 1423-1465.
- Loughran, T./McDonald, B., 2016, Textual Analysis in Accounting and Finance: A Survey. Journal of Accounting Research 54(4). 1187-1230.
- Loughran, T./McDonald, B., 2014, Measuring Readability in Financial Disclosure. The Journal of Finance 69(4). 1643-1671.
- Loughran, T./McDonald, B., 2011, When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. The Journal of Finance 66(1). 35-65.
- Li, Feng, 2008, Annual Report Readability, Current Earnings, and Earnings Persistance. Journal of Accounting and Economics 45(2). 221-247.
- Tetlock, Paul C., 2007, Giving Content to Investor Sentiment: The Role of Media in the Stock Market. The Journal of Finance 62(3). 1139-1168.
- ECTS Credits
- Teaching language
Autumn. Offered Autumn 2018
Assistant Professor Christian Langerfeld, Department of Professional and Intercultural Communication, NHH
Assistant Professor Maximilian Rohrer, Department of Finance, NHH