Completeness of datasets for large text uploads.

May 27, 2020 08:02

Hi,

I hope you can support me by providing information on 2 related topics.

1) Can you direct me to more information around the Stats feature of the analysis and specifically to definitions and uses for the degree, frequency, betweenness, topic, conductivity, locality & diversivity in analysis?

2) I am using the tool to analyse discourses in a number of domains which seems to require transposing rich documents into txt formats to both reduce file size and eliminate graphics. It should also involved analysing related documents (e.g 3 different sources of testimony) into a single network. When I do this, words/nodes I know to be present in the dataset and in the individual testimony graphs do not feature in the consolidated graph or stats and there is no way to determine what the full data set that has been uploaded is.

I am a subscriber but the app seems to not accept text files above 250kb despite the 3Mb notification so I have been splitting files and uploading them in sequence. Is this approach likely to impair the completeness or integrity of the network analysis e.g. dropping nodes from previous revisions at each cycle?

Thank you for your support. I find it a fascinating and useful tool but I do need to know it is working on the full data set.

Best wishes

John

Comments

1 comment

Official comment
Dmitry Paranyushkin

June 22, 2020 15:10
Hi John!

Regarding your first question — if you open the Analytics > Stats panel and then click the Help menu on the right it will walk you through all these parameters in detail. Here is an explanation of each:

DEGREE — shows how many connections the node has: correlates with the frequency.
FREQUENCY — how often the word appears in the text.
BETWEENNESS — normalized measure of betweenness centrality: how often the node appears on the shortest path between any two randomly chosen nodes in the graph. Measures global influence.
TOPIC — which topical cluster the node belongs to — userful for classification.
CONDUCTIVITY — Betweenness divided by Degree: which nodes can reach the most nodes in the different parts of the network faster with the least connections.
LOCALITY — Degree squared divided by Betweenness: Local influencers with the least global connections.
DIVERSIVITY — Betweenness divided by Frequency: highest global connectivity with the least mentions - indicates the turning points that produce the narrative plot shifts.

Regarding the big files, you are right to convert them to TXT to reduce the size. By default, the software cuts the top 150 nodes and shows them to you. So if you need more nodes, simply go to the Settings and choose show 500 nodes (or 1000). Then you will have all the nodes in the graph.

You can also split your text files into several parts and then upload them one by one into the graph with the _same_ name. Like this you have all the data you need in one place.

Keep in mind, however, that the bigger the graph gets, the harder it is to read it (you'll still get meaningful topical cluster insights though). So that's why we set a limit on the nodes shown.

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?