BAN432 Applied Textual Data Analysis for Business and Finance
Recent years have seen an increase of textual data that is accessible in electronic form. Examples are annual reports, company releases, and newspaper articles, or user generated content on social media such as blogs, forums and tweets. All of this textual data is generated by humans and may thus contain information about the author`s opinions and preferences. Textual data analysis is the process of deriving high-quality information from text, which can subsequently be used for economic decision making. Employing computers to process textual data allows for (1) analysing digital information more quickly than what is possible for the human mind, (2) detecting high dimensional patterns, and (3) conducting structural analyses on textual data.
While this course primarily looks into applications for finance, textual data analysis can be applied to other fields as well.
In this course the following topics are covered:
- Introduction to R: The course is based on the free software environment R. Students will get an introduction to the basic concepts in R which are relevant for text mining. Students will learn to use websources such as R vignettes and stackoverflow.com effectively.
- Data collection: Students will learn how to write R-scripts to collect data from different sources. We will be using different Application Programming Interfaces (APIs) and the Electronic Data Gathering, Analysis, and Retrieval system (EDGAR) for firm data. For unstructured data, we will program simple web scrapers.
- Preprocessing and structuring of data: Most of the retrieved data contains unwanted mark-ups like html-tags or non-words that have to be removed before the analysis. The students will learn how to use regular expressions in R, in order to filter the data and prepare it for analysis.
- Data analysis and application: During the course the students will learn to analyse textual data in various ways, and apply the results to a finance/business context. In particular the course will introduce the students to the following concepts:
- Sentiment Analysis evaluates the tone (positive/negative) of a document. This can be used to judge the attitude of a given firm announcement, which in turn could be implemented into an automated trading algorithm.
- Social Media Analysis is used to extract structural information from data provided by the actual consumer (e.g. Tweets). For example, this can be used to measure the customer's attitude towards new products.
- Document Clustering evaluates the similarity of texts. This can be used to cluster companies to study economic linkages such as a product competition or a supplier relationship.
- Machine Learning applications for textual data. We will apply algorithms to detect which textual features (keywords etc.) can predict certain outcomes, such as predicting product ratings based on written customer reviews.
- Other methods from Natural Language Processing (NLP) and Corpus Linguistics: termness, part-of-speech tagging, collocations, representativeness, etc.
- The course will include a guest lecture.
- Critical understanding: While computers work more efficiently than humans, they are unable to understand text in the way of the human mind. The students will be introduced to limitations of the text-mining approaches used in class.
After completing the course successfully, the students will have acquired the knowledge, practical skills and competence as specified below:
KNOWLEDGE - The candidate...
- will be familiar with packages in the R software environment that are relevant for text mining
- will have knowledge about important concepts in Natural language processing (NLP) and corpus linguistics
- will have a critical and reflective understanding about the text mining approaches presented in the course
SKILLS - The candidate will be able to...
- collect textual data from digital sources relevant for finance and business
- preprocess textual data to make it accessible for the actual analysis
- analyse the retrieved data using natural language processing (NLP)
- visualise results
- use the results to guide economic decision making
COMPETENCE - The candidate...
- has the knowledge necessary to carry out textual data analysis in order to guide economic decisions
- will have an academic basis that prepares for future studies within the field of business and finance using textual data analysis, such as a Master Thesis
This course focuses on practice. All theoretical concepts are applied by implementing the actual code in R.
This course is taught using a combination of regular lectures and examples in R. The lectures are aimed at providing the core information concerning the principles of analysing textual data and their application in finance and business. All subjects discussed will be implemented using R.
The lectures take place on campus (if the covid situation permits). If there are any restrictions, the lectures will be streamed in Zoom.
Although not required, a basic knowledge of R is recommended. Students without any prior knowledge in R are expected to keep up with the course's demands independently.
While not a formal prerequisite, students will benefit from having attended BAN401 or BAN420. There will be synergies with any course using the programming language R.
Credit reduction due to overlap
This course is a continuation of FIE452 and the total number of attempts applies to the course (not the course code).
Requirements for course approval
Course approval consists of two parts:
- Two group based homework assignments with a deadline of one week. The group size will be announced at the beginning of the semester. The groups are formed by the students themselves. Both of the group based homework assignments have to be passed.
- Six short assignments that the students have to work on individually . The expected workload for those short assignments is about 30-45 minutes each. In addition to the two group assignments, 5 of the 6 short, individual assignments have to be passed.
The final grade is based on the work with a project that is handed in as a written report (60%) and an oral presentation of the findings in the report (40%) at the end of the semester. The work with the final project and the presentation is group based. All group members will have to take actively part in the writing of the report and the oral presentation. While the grade for the written report is the same for all group members, the grade for the oral exam is given individually.
The students will have one week to complete the final project (report and presentation).
While we expect to hold all oral exams in person, it can be held remotely via Zoom for covid related reasons.
The final project and the homework assignments have to be submitted in English. Students who want to retake the course, have to retake both the final project and the homework assignments. All elements have to be submitted in the same semester.
Laptop with a working installation of R. We will work with R during the course and students are required to bring their laptops to class.
We recommend using a user interface such as R-Studio (or others).
Documentation and vignettes for the used R-packages
Introductions to R:
- Lam, Longhow: An Introduction to R., available on cran.r-project.org
- Venables, W.N./Smith, D.M., 2016, An Introduction to R., available on cran.r-project.org
- Paradis, E., 2011, R for Beginners, available on cran.r-project.org
- García, D., 2013, Sentiment During Recessions. The Journal of Finance 68(3). 1267-1300.
- García, D./Norli, Ø., 2012, Crawling EDGAR. The Spanish Review of Financial Economics 10. 1-10.
- García, D./Norli, Ø., 2012, Geographic Dispersion and Stock Returns. Journal of Financial Economics 106. 547-565.
- Hoberg, G./Phillips, G., 2016, Text-Based Network Industries and Endogenous Product Differentiation. Journal of Political Economy 124(5). 1423-1465.
- Loughran, T./McDonald, B., 2016, Textual Analysis in Accounting and Finance: A Survey. Journal of Accounting Research 54(4). 1187-1230.
- Loughran, T./McDonald, B., 2014, Measuring Readability in Financial Disclosure. The Journal of Finance 69(4). 1643-1671.
- Loughran, T./McDonald, B., 2011, When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. The Journal of Finance 66(1). 35-65.
- Li, Feng, 2008, Annual Report Readability, Current Earnings, and Earnings Persistence. Journal of Accounting and Economics 45(2). 221-247.
- Tetlock, Paul C., 2007, Giving Content to Investor Sentiment: The Role of Media in the Stock Market. The Journal of Finance 62(3). 1139-1168.
- Additional literature might be added to the reading list at the beginning of the course.
- ECTS Credits
- Teaching language
Autumn. Offered autumn 2021.
Assistant Professor Christian Langerfeld, Department of Professional and Intercultural Communication, NHH (main course responsible)
Assistant Professor Maximilian Rohrer, Department of Finance, NHH