BAN432 Applied Textual Data Analysis for Business and Finance

Autumn 2025

Topics
Recent years have seen an increase of textual data that is accessible in electronic form. Examples are annual reports, company releases, and newspaper articles, or user generated content on social media such as blogs, forums and tweets. All of this textual data is generated by humans and may thus contain information about the author`s opinions and preferences. Textual data analysis is the process of deriving high-quality information from text, which can subsequently be used for economic decision making. Employing computers to process textual data allows for (1) analysing digital information more quickly than what is possible for the human mind, (2) detecting high dimensional patterns, and (3) conducting structural analyses on textual data.
While this course primarily looks into applications for finance, textual data analysis can be applied to other fields as well.
In this course the following topics are covered:
- Introduction to R: The course is based on the free software environment R. Students will get an introduction to the basic concepts in R which are relevant for text mining. Students will learn to use websources such as R vignettes and stackoverflow.com effectively.
- Data collection: Students will learn how to write R-scripts to collect data from different sources. We will be using different Application Programming Interfaces (APIs) and the Electronic Data Gathering, Analysis, and Retrieval system (EDGAR) for firm data. For unstructured data, we will program simple web scrapers.
- Preprocessing and structuring of data: Most of the retrieved data contains unwanted mark-ups like html-tags or non-words that have to be removed before the analysis. The students will learn how to use regular expressions in R, in order to filter the data and prepare it for analysis.
- Data analysis and application: During the course the students will learn to analyse textual data in various ways, and apply the results to a finance/business context. In particular the course will introduce the students to the following concepts:
  - Sentiment Analysis evaluates the tone (positive/negative) of a document. This can be used to judge the attitude of a given firm announcement, which in turn could be implemented into an automated trading algorithm.
  - Social Media Analysis is used to extract structural information from data provided by the actual consumer (e.g. Tweets). For example, this can be used to measure the customer's attitude towards new products.
  - Document Clustering evaluates the similarity of texts. This can be used to cluster companies to study economic linkages such as a product competition or a supplier relationship.
  - Machine Learning applications for textual data. We will apply algorithms to detect which textual features (keywords etc.) can predict certain outcomes, such as predicting product ratings based on written customer reviews.
  - Other methods from Natural Language Processing (NLP) and Corpus Linguistics: termness, part-of-speech tagging, collocations, representativeness, etc.
- NLP is in rapid development, so we will discuss current topics such as the language model BERT/FinBERT or ChatGPT.
- The course will include a guest lecture.
- Critical understanding: While computers work more efficiently than humans, they are unable to understand text in the way of the human mind. The students will be introduced to limitations of the text-mining approaches used in class.
BAN432 and BAN443 are complementary courses with different focus. In BAN432 we approach textual analysis bottom up: how to obtain data, clean it, and different applications. We implement these steps with code that we develop in class. BAN443 focuses on the application of LLMs in a business context.
Learning outcome
After completing the course successfully, the students will have acquired the knowledge, practical skills and competence as specified below:
Knowledge
The candidate:
- will be familiar with packages in the R software environment that are relevant for text mining
- will have knowledge about important concepts in Natural language processing (NLP) and corpus linguistics
- will have a critical and reflective understanding about the text mining approaches presented in the course
Skills
The candidate will be able to:
- collect textual data from digital sources relevant for finance and business
- preprocess textual data to make it accessible for the actual analysis
- analyse the retrieved data using natural language processing (NLP)
- visualise results
- use the results to guide economic decision making
General competence
The candidate:
- has the knowledge necessary to carry out textual data analysis in order to guide economic decisions
- will have an academic basis that prepares for future studies within the field of business and finance using textual data analysis, such as a Master's Thesis
Teaching

This course is taught using a combination of regular lectures and examples in R. The lectures are aimed at providing the core information concerning the principles of analysing textual data and their application in finance and business. All subjects discussed will be implemented using R.
Recommended prerequisites

Although not required, a basic knowledge of R is recommended. Students without any prior knowledge in R are expected to keep up with the course's demands independently.
Required prerequisites

None
Credit reduction due to overlap

None.
Compulsory Activity
Compulsory activities (work requirements) consists of two parts:
- Two group-based homework assignments in English with a deadline of one week. The group size will be announced at the beginning of the semester. The groups are formed by the students themselves. Both of the group-based homework assignments have to be submitted and approved.
- Six short assignments in English that the students have to work on individually. In addition to the two group assignments, 5 of the 6 short, individual assignments have to be submitted and approved.
Assessment

The final grade is based on a written project report (60%) and an oral presentation and discussion of the report (40%) at the end of the semester. Both are in English and group-based (3-4 students per group). All group members need to take active part in writing and presenting the report. While the grade for the written report is the same for all group members, the grade for the oral exam is given individually. The students have one week to complete the final project (report and presentation).
All elements have to be submitted in the same semester. Thus, a retake implies submitting the homework assignments and the final project (report and presentation) again.
Grading Scale

A-F
Computer tools

Laptop with a working installation of R. We will work with R during the course and students are required to bring their laptops to class.
We recommend using a user interface such as R-Studio (or others).
Literature
Documentation and vignettes for the used R-packages
Introductions to R:
- Lam, Longhow: An Introduction to R., available on cran.r-project.org
- Venables, W.N./Smith, D.M., 2016, An Introduction to R., available on cran.r-project.org
- Paradis, E., 2011, R for Beginners, available on cran.r-project.org
Other literature:
- García, D., 2013, Sentiment During Recessions. The Journal of Finance 68(3). 1267-1300.
- García, D./Norli, Ø., 2012, Crawling EDGAR. The Spanish Review of Financial Economics 10. 1-10.
- García, D./Norli, Ø., 2012, Geographic Dispersion and Stock Returns. Journal of Financial Economics 106. 547-565.
- Hoberg, G./Phillips, G., 2016, Text-Based Network Industries and Endogenous Product Differentiation. Journal of Political Economy 124(5). 1423-1465.
- Loughran, T./McDonald, B., 2016, Textual Analysis in Accounting and Finance: A Survey. Journal of Accounting Research 54(4). 1187-1230.
- Loughran, T./McDonald, B., 2014, Measuring Readability in Financial Disclosure. The Journal of Finance 69(4). 1643-1671.
- Loughran, T./McDonald, B., 2011, When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. The Journal of Finance 66(1). 35-65.
- Li, Feng, 2008, Annual Report Readability, Current Earnings, and Earnings Persistence. Journal of Accounting and Economics 45(2). 221-247.
- Tetlock, Paul C., 2007, Giving Content to Investor Sentiment: The Role of Media in the Stock Market. The Journal of Finance 62(3). 1139-1168.
- Additional literature might be added to the reading list at the beginning of the course.