Skip to main content

Completeness of datasets for large text uploads.


1 comment

  • Official comment
    Dmitry Paranyushkin

    Hi John!

    Regarding your first question — if you open the Analytics > Stats panel and then click the Help menu on the right it will walk you through all these parameters in detail. Here is an explanation of each:

    DEGREE — shows how many connections the node has: correlates with the frequency.
    FREQUENCY — how often the word appears in the text.
    BETWEENNESS — normalized measure of betweenness centrality: how often the node appears on the shortest path between any two randomly chosen nodes in the graph. Measures global influence.
    TOPIC — which topical cluster the node belongs to — userful for classification.
    CONDUCTIVITY — Betweenness divided by Degree: which nodes can reach the most nodes in the different parts of the network faster with the least connections.
    LOCALITY — Degree squared divided by Betweenness: Local influencers with the least global connections.
    DIVERSIVITY — Betweenness divided by Frequency: highest global connectivity with the least mentions - indicates the turning points that produce the narrative plot shifts.


    Regarding the big files, you are right to convert them to TXT to reduce the size. By default, the software cuts the top 150 nodes and shows them to you. So if you need more nodes, simply go to the Settings and choose show 500 nodes (or 1000). Then you will have all the nodes in the graph.


    You can also split your text files into several parts and then upload them one by one into the graph with the _same_ name. Like this you have all the data you need in one place. 

    Keep in mind, however, that the bigger the graph gets, the harder it is to read it (you'll still get meaningful topical cluster insights though). So that's why we set a limit on the nodes shown.

Please sign in to leave a comment.