Applied Textual Data Analysis for Business and Finance

FIE452 Applied Textual Data Analysis for Business and Finance

  • Topics


    Recent years have seen an increase of textual data that is accessible in electronic form. Examples are annual reports, company releases, and newspaper articles, or user generated content on social media such as blogs, forums and tweets. All of these text data are generated by humans and may thus contain information about the author`s opinions and preferences. Textual data analysis is the process of deriving high-quality information from text, which can subsequently be used for economic decision making. Employing computers to process textual data allows for (1) analysing digital information more quickly than what is possible for the human mind, (2) detecting high dimensional patterns, and (3) conducting structural analyses on textual data.

    There are several applications for business and finance in general, e.g.:

    • Automated trading algorithms that trade on contemporaneous news (Finance)
    • Social Media analysis and targeted customer analysis (Marketing)
    • Automatic fraud detection and forensic accounting (Accounting)


    In this course the following topics are covered:

    • Introduction to R: The course is based on the free software environment R. Students will get an introduction to the basic concepts in R which are relevant for text mining. Students will learn to use websources such as R vignettes and effectively.
    • Data collection: Students will learn how to write R-scripts to collect data from different sources. In particular, this course discusses Application Programming Interfaces (API), for example those provided by Google, Twitter, or New York Times. The course introduces the students to techniques of retrieving company filings, such as Annual Reports and Merger Prospectus through the Electronic Data Gathering, Analysis, and Retrieval system (EDGAR).
    • Preprocessing and structuring of data: Most of the retrieved data contains unwanted mark-ups like html-tags or non-words that have to be removed before compiling a corpus (collection of texts). The students will learn how to use regular expressions in R, in order to filter the data and prepare it for analysis.
    • Data analysis and application: During the course the students will learn to analyse the real text data in various ways, and apply it to a business context. In particular the course will introduce the students to the following concepts:
      • Sentiment Analysis evaluates the tone (positive/negative) of a document. This can be used to judge the attitude of a given firm announcement, which in turn could be implemented into an automated trading algorithm.
      • Social Media Analysis is used to extract structural information from data provided by the actual consumer (e.g. Tweets). For example, this can be used to measure the customer¿s attitude towards new products.
      • Document Clustering evaluates the similarity of texts. This can be used to cluster companies to study economic linkages such as a product competition or a supplier relationship.
      • Natural Language Processing (NLP) and Corpus Linguistics: termness, part-of-speech tagging, collocations, representativeness, etc.
    • Case Study: The course will include a guest lecture where the students will be introduced to relevant corpora. The lecturer will discuss the practical challenges of constructing a large-scale corpus and making it accessible. Central concepts in computational/corpus linguistics and phraseology, including n-grams, keyness, collocations, statistical measures of association, dispersion, etc. will be introduced. This case study will explore the characteristics of business news discourse contrasted with other types of discourse.
    • Critical understanding: While computers work more efficient than humans, they are unable to understand text in the way of the human mind. The students will (among other things) be introduced to the following limitations:
      • Context. In everyday language the words "liability" and "debt" have a negative undertone, while in a business context they only refer to a certain form of capital.
      • Reading between the lines. A computer is unable to detect sarcasm, irony, metaphors, etc.
      • Classification of words. "Up" is often classified as positive, which would not be the case with reference to unemployment.
      • Negation: The sentiment of a given word is inverted by including the word "not", which might be difficult for a computer to understand.

  • Learning outcome

    Learning outcome

    After completing the course successfully, the students will have acquired the knowledge, practical skills and competence as specified below:

    KNOWLEDGE - The candidate...

    • will be familiar with packages in the R software environment that are relevant for text mining
    • will have knowledge about important concepts in Natural language processing (NLP) and corpus linguistics
    • will have a critical and reflective understanding about the text mining approaches presented in the course
    • will have an academic basis that prepares for future studies within the field of business and finance using textual data analysis

    SKILLS - The candidate will be able to...

    • collect textual data from digital sources relevant for business and finance
    • preprocess textual data to make it accessible for the actual analysis
    • analyse the retrieved data using natural language processing (NLP)
    • visualise results
    • use the results to guide economic decision making


    COMPETENCE - The candidate...

    • has the knowledge necessary to carry out textual data analysis in order to guide economic decisions

    This course focuses on practice. All theoretical concepts are applied by implementing the actual code in R.

  • Teaching


    This course is taught using a combination of regular lectures and examples in R. The lectures are aimed at providing the core information concerning the principles of analysing textual data and their application in finance and business. All subjects discussed will be implemented using R. Students have to bring their laptops to class and work with the examples individually or in groups.

  • Restricted access

    Restricted access

    Enrollment to this course is limited to 50 students due to pedagogical methods. The deadline for registration for limited enrollment courses is earlier than for unlimited courses. For more information, please see:

  • Recommended prerequisites

    Recommended prerequisites

    Although not required, a basic knowledge of R is recommended. Students without any prior knowledge in R are expected to keep up with the course's demands independently.

    While not a formal prerequisite, students will benefit from having attended BUS455 Applied programming and data analysis for business. The focus of this course is different from FIE446 Financial Engineering, but there will be synergies due to the use of the same programming language.

  • Assessment


    The total grade is based on a final project (70%), and 10% each for three homework assignments. The final project and the homework assignments are group-based.

    Final project: The students will have one week to complete the final project. During the final project, the students have to solve a set practical problem in R. The solution will consist of a plain text file, including the commented code, and a report including a discussion of the results.

    Homework assignments: For the homework assignments, the paper will consist of a plain text file, including the commented code. All homework assignments have to be delivered within the deadlines and the grade must be at least E.

    Grading: The grading will be based on the correctness of the results, the clarity and persuasiveness of each assignment, as well as the student's ability to learn and implement new concepts. Students are expected to relate their final project to the relevant literature. The final project and the homework assignments have to be submitted in English. Students who want to retake the course, have to retake both the final project and the homework assignments. All elements have to be submitted in the same semester.

  • Grading Scale

    Grading Scale


  • Computer tools

    Computer tools

    Laptop with a working installation of R. We will work with R during the course and students are required to bring their laptops to class.

    We recommend using a user interface such as R-Studio (or others).

  • Semester



  • Literature


    Documentation and vignettes for the used R-packages

    Introductions to R:

    • Lam, Longhow: An Introduction to R., available on
    • Venables, W.N./Smith, D.M., 2016, An Inroduction to R., available on
    • Paradis, E., 2011, R for Beginners, available on


    Other literature:

    • García, D., 2013, Sentiment During Recessions. The Journal of Finance 68(3). 1267-1300.
    • García, D./Norli, Ø., 2012, Crawling EDGAR. The Spanish Review of Financial Economics 10. 1-10.
    • García, D./Norli, Ø., 2012, Geographic Dispersion and Stock Returns. Journal of Financial Economics 106. 547-565.
    • Hoberg, G./Phillips, G., 2016, Text-Based Network Industries and Endogenous Product Differentiation. Journal of Political Economy 124(5). 1423-1465.
    • Loughran, T./McDonald, B., 2011, Textual Analysis in Accounting and Finance: A Survey. Journal of Accounting Research 54(4). 1187-1230.
    • Loughran, T./McDonald, B., 2011, When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. The Journal of Finance 66(1). 35-65.
    • Tetlock, Paul C., 2007, Giving Content to Investor Sentiment: The Role of Media in the Stock Market. The Journal of Finance 62(3). 1139-1168.
    • Hoberg, G./Lewis, C., 2017, Do fraudulent firms produce abnormal disclosure? Journal of Corporate Finance 43. 58-85.


ECTS Credits
Teaching language

Course responsible

Christian Langerfeld, Department of Professional and Intercultural Communication

Maximilian Rohrer, BI Norwegian Business School, Department of Finance