In this tutorial, we will demonstrate how to use text network thematic analysis for qualitative research with the help of InfraNodus software. This approach can be used to analyze open survey responses, interviews, and free-form questionnaire responses.
Unlike other tools, InfraNodus does not only provide text mining insights (based on peer-reviewed research) but also visualizations of relations between ideas and recurrent themes. This makes it easy for researchers to code, categorize, and reveal patterns and gaps in data. Additionally, it makes use of GPT AI to help researchers interpret their discoveries and generate new interesting ideas based on the discourse they study.
0. What are Qualitative Analysis and Thematic Analysis?
Qualitative analysis is a research methodology used to analyze non-numerical data. It examines data from interviews, observations, documents, and other sources to decode human behavior, emotions, and social constructs. Types of data analyzed using qualitative analysis include interviews, surveys, documents, literature, audio or video recordings, and raw data.
Thematic analysis is a specific method of qualitative data analysis which involves identifying, coding, and interpreting patterns (themes) in the data in order to gain a deeper understanding of people's motivations and beliefs. It is used in many disciplines such as psychology, anthropology, sociology, education, and public health.
1. Getting the Data: Interviews, Open Surveys, Questionnaires, Reviews
The most typical sources of data for qualitative analysis include interviews, open surveys, questionnaires, and customer reviews. Most of the time, we would need to gather this data by conducting the interviews ourselves or sending out online surveys. However, there are some interesting datasets also available online, for example, on Kaggle — a platform for data scientists.
In this tutorial, we will use the data from Kaggle's open survey of data scientists conducted in 2017. This survey had a freeform questionnaire where respondents were asked to identify the main problems they're facing in their personal projects. We will use these freeform responses and perform qualitative research and thematic analysis of these open-ended answers using InfraNodus.
2. Visualizing Text as a Network: Graph Theory Algorithms in Action
2/A. Importing the Data
The first step is to visualize the data in InfraNodus. We will be using a CSV spreadsheet file that contains open survey responses in one column. We will then import this file using the Apps > CSV import template and select the column we want to process:
2/B. Text Network Representation
InfraNodus will then import the text from this column and every row will be represented as a statement. The lemmas of the words and the concepts used in each statement will be represented as nodes on the graph, while their co-occurrences are represented as relations.
This lets us build a text network graph that shows how these concepts relate to one another. InfraNodus will also apply Force Atlas layout and community detection modularity algorithms that group the words into thematic clusters so that the words that are used in the same context are positioned closer to each other and have the same color. We then apply ranking algorithms to detect the words that have the highest betweenness centrality (play an important role in connecting the different themes together), so that the more influential nodes are shown bigger on the graph.
This visual representation already allows us to see the patterns of meaning that emerge in the text, as well as the main topics, concepts, and the relations between them. The analytics panel shows us all the insights derived from this text network representation:
2/C. Cleaning the Graph / Data
As you can see, the concepts "data" and "dataset" are very prevalent. The reason is that most of the respondents used these terms in their free-form answers. So we can hide them from the graph to recalculate the metrics and get to the more nuanced results:
The final graph looks like this:
3. Interpretation of Data: Finding the Main Themes, Patterns, and Relations
Graph visualization is a very useful way of representing data because it takes advantage of the inherent human capacity to recognize patterns. A researcher can also disassemble the text into small parts, inspect the relations, and have a tangible understanding of how the meaning is constructed within.
Using the Graph to Reveal the Main Ideas
The first step in interpreting the data is to look at the graph and see what emerges from it. For example, we can directly see that the following terms are quite influential:
It is quite clear that in the context of data scientists' reporting the problems that they experience (this is what our original dataset is), we can quickly understand that they face problems in finding and cleaning the data, as well as a lack of datasets and time.
We did this in just a few seconds by looking at the graph.
Revealing the Main Themes
These same insights are available in the Analytics panel > Most Influential Concepts.
We can also reveal the main themes in the Analytics Panel > Topical Clusters (also shown with the same color on the graph):
• lack, documentation, missing
• project, learning, specific
• finding, set, question
• understanding, feature, processing
Two of the four clusters give us a more precise idea about the problems that data scientists are facing:
• lack of or missing documentation for datasets
• finding the right sets or questions
Revealing the Important Terms in Context
The remaining two can be clarified if we click on these terms (e.g. "project", "learning") and see in which context they are used:
We now better understand the other two main themes:
• unsuitability of data for machine-learning projects
• limited number of features for processing
Thanks to this first iteration of analysis, we now have a clear understanding of the main problems data scientists face in their personal projects work:
• lack of documentation
• finding data sets
• unsuitability of data for machine learning
• time constraints
Thematic Analysis with GPT-4 and ChatGPT
The approach we demonstrated above is based solely on text mining and visual graph analysis. However, you could also obtain all these interpretations using the built-in GPT-3 and GPT-4 AI tools.
For example, we can use the built-in GPT-3 AI to give names to the topical clusters (themes) we identified above to reveal high-level ideas in these responses. We can also use GPT to generate a quick summary for the main topics identified, which is quite a precise explanation of the main concerns that our respondents have:
We can see that the main problems are related to data quality and pre-processing concerns, as well as problems with machine learning implementation.
The AI-generated summary also points to these problems highlighting the need to clean and pre-process the data, which is time-consuming and is not the most favorite part of the job for many data scientists.
Sentiment Analysis of Survey Responses
The Analytics panel also has the Semantic Analysis tab, which can filter responses into Negative / Positive / Neutral categories. This may be useful if we want to compare the feedback from the people who are saying positive things to the people who are saying negative things, so that we better understand the nature of each sentiment and can adjust our response accordingly.
For instance, in the example above, we can see that when we filter the positive / negative responses, much more negative connotation can be observed in the data concerning the dataset quality and pre-processing. The topic on "machine learning" appears in positive comments, which means that even though people are talking about the problems, the machine learning part is not affecting them as negatively as the pre-processing / data quality issues:
Relational Analysis of Concepts
Another interesting feature is the ability to analyze concepts and its relations. If we go to the Analytics > Relations tab, we can see the most prominent relations between the concepts (bigrams that appear in 4-gram windows). These show the most typical combinations of concepts that can be encountered through the text.
If we select some of them on the graph, we can see which other concepts they relate to, so we can better understand the context.
For instance, if we select "big challege", which is the most prominent phrase used in responses, we can see from the graph that it relates to "cleaning", "format", "finding", and "set", pointing to the pre-processing problem we identified earlier as well as a difficulty to find the right data.
The measure of influence in the Analytics > Relations panel on the right gives us a relative measure of importance for a particular combination of words we're analyzing.
For instance, the words "big" and "challenge" have a common influence of 0.08 (or 4% of the total). We can select the words "lack" and "documentation" and while it's a less frequent combination, its total influence is 0.2324 (11%) — almost 3 times higher. This indicates that this particular problem plays a more important role in the total discourse.
Numerical values for various words and relations enables us to provide quantitive measure to qualitative insights we get from the graphs.
This approach lets us get more nuance about a specific theme that we identified and dig deeper to better understand the detail of responses.
What do People Say about X? AI-Generated Summaries for Specific Topics
We can also use built-in AI GPT tool to generate a summary of the statements we selected. In the example above, we selected "big challenge" and InfraNodus filtered 26 statements that contain these concepts.
We can then click on AI: Summarize Visible Statements to generate a summary for the visible statements, producing the following result:
As you can see, it provides a pretty good summary of the main "big challenges" people face: data processing, structure of data, finding storage. We also find out that there's a challenge of fake data and parsing data.
Actionable insight: It is very clear that poor quality of data and pre-processing are some of the biggest challenges that data scientists face. If we were to develop a policy or a product solution, we would want to focus on these aspects because they would enable us to bring the most added value to our respondents. For instance, we could develop a platform for sharing high-quality public data sets with a built-in rating system. If we were a public body, we could put most of our public data online paying a particular attention to the quality of the datasets, so that researchers can create various publications and studies based on this data.
4. Structural Gap Analysis: Finding What's Missing
A very powerful feature of text network graphs is that they enable us to see the structure of the discourse and reveal the gaps between the different themes identified.
Using this approach, we can see what's missing and possible solutions that could span across various themes identified.
Connecting the Different Topics
In order to do this, we can look at the graph and reveal some of the topics that are distinct from one another (preferably — at the opposite sides as this means they are semantically the furthest). We can also use the Analytics > Gap Insight feature that will identify the gap for us and use GPT-3 AI to generate a research question that can bridge this gap with an interesting idea.
To reveal the gap, click "Show Highlight". To reload the gap: "Show Another Gap". It's recommended to reiterate a few times until you find a combination of two themes that are the furthest away from each other. For instance:
In this case, we revealed the gap between the two themes: machine learning and poor data quality.
We can interpret this ourselves by linking these two topics together, for instance, by saying that poor data quality hinders our capacity to impelement machine learning algorithms. We can also use built in GPT-4 AI (that also powers ChatGPT) to automatically generate this for us. In this case, it's pointing at a problem where lack of documentation makes it difficult to implement machine learning algorithms.
Finding Discourse Entry Points
Another really powerful feature of InfraNodus is its ability to identify discourse connector / entry points. These are the concepts that have high influence per the number of connections they have. Usually, these are the ideas located at the periphery of the discourse that are very well connected to the central ideas. They are not overwhelmed by the number of connections they have, yet, they are well integrated. A good analogy are the people that know not so many people, but the people they know are the most important ones.
For the discourse we're analyzing, these words are: "incomplete, documentation, public, set, big" — they are available in the Analytics Panel > Gap Insight > Discourse Connector Points:
We then use GPT-4 AI to generate an idea that would link those concepts together. In this case, it's proposing us to think of a platform that could facilitate collaborative work on documenting public datasets — a really good idea in relation to the discourse we study.
Actionable insight: If we were developing a policy or a solution to this particular problem, we would ensure to emphasize good data quality especially in the context of machine learning applications as we would bridge the gap and address multiple important problems at the same time with the solution that we propose. One solution could be to set up a platform where users could annotate and document public datasets and exchange datasets with each other.