This is an excerpt from Applied Sport Business Analytics With HKPropel Access by Christopher Atwater,Robert E. Baker & Ted Kwartler.
Basic Text Processing Workflow
Now that you have a basic understanding of string manipulations, it is time to explore a basic framework for tackling a natural language processing analysis. Because of the difficulty associated with NLP stemming from the diversity of language, the first step is to clearly define the aim of the project. It may be high level, such as to explore and define methods for gauging social media fan engagement, or the aim may be something more precise, such as to analyze scouting reports alongside other athletic data to model on-field performance. Without a problem definition, you will be doing “curiosity analysis” with no direction. Given the challenges of natural language analytics, strive to be as succinct as possible and be willing to iterate and adjust along the way. Once you have a problem reasonably defined, this should lead you toward a channel and specific pieces of text for analysis. You may use online reviews, contracts, or something else, but it is rarely the entire Internet or some vast collection of unrelated and diverse documents. Next, you need to preprocess the documents, which entails organization and feature extraction. A simple example would be collecting 10,000 tweets mentioning a player. Once organized into a corpus, or collection of related documents, you can extract features such as sentiment analysis from those documents. The features or values extracted vary depending on the type of analysis you expect to perform. It could be as simple as counting the occurrence of a term or as complex as creating a modeling matrix for use in a deep neural net model to classify documents. In any case, once the appropriate features have been extracted from the documents, you then run the analysis and finally seek to address the problem definition. Once again, addressing the problem statement may be as simple as providing a visual like a word cloud or as complex as using the output of the analysis in a customer propensity machine learning model.
In review, the basic steps of an NLP project are outlined below.
- Problem definition
- Identifying text sources
- Preprocessing and feature extraction
- Insight and recommendations
In this chapter, the problem we have defined as an example is fan engagement in social media for various teams in the National Basketball Association (NBA) using multiple common methods and marquee players (step 1 above). The methods can be applied to other types of documents yet are not an exhaustive set of approaches. However, the methods used in the chapter are useful and satisfying to explore.
Thus, in the NBA fan engagement example, our steps are as follows:
- Identifying text sources: We will focus on a collection of tweets amassed daily throughout the 2019-2020 NBA season.
- Preprocessing and feature extraction: Conduct string manipulation and organization into a document term matrix to get term frequency.
- Analytics: Build visualizations such as bar charts, word clouds, and pyramid plots. Perform word associations and sentiment analysis.
- Insight and recommendations: Within the provided text, identify the most discussed teams, terms, and corresponding sentiment representing the Twitter dialogue of fans and sports professionals.