In the ideal world, it should be possible to export any data we find on the web in an easily comprehensible format for further processing.
However, when it is not possible, you have to get "hacky" and get that data yourself. Mind you, this may be against a website's terms of use, so we do not encourage the approach that we demonstrate here. Rather, it's just a technique that you may find useful whenever a simple copy and paste doesn't work and when it is legal to use.
Your best friend, in this case, is the Javascript console that can be accessed via View > Developer > Javascript console in the Google Chrome browser. Once you open the console, you can use the Javascript code to extract any element from the website in the plain text format.
Specifically, you need to find the HTML class name for the elements you want to extract and then run Javascript code inside the browser to extract the text content in the elements of this class for you.
Below, we explain how you can do that using a combination of ChatGPT or jQuery and Chrome web console using the example of LinkedIn posts and Reverso Dictionary.
Manual Scraping from Websites Using Your Browser
Suppose you would like to import the data from Reverso.Context into InfraNodus to analyze the semantic field around a certain concept or phrase.
Reverso Context is a resource that provides examples of how certain words or phrases are used in specific contexts. There is no API in Reverso and copying and pasting the data doesn't work, because you get the phrases in both languages and it's not so useful for analysis.
What to do instead?
Stage 1: Finding the Class Name of the Elements to Scrape
In order to extract the data from any web page, you need to identify the class name of the elements you want to extract in the page's HTML code.
This works best in the Google Chrome browser and will function for any website. Don't worry about the code below, it's not that hard.
Step 1: Open the Google Chrome Javascript console (View > Developer > Javascript Console)
Step 2: Point to the element that you're interested in
Step 3: Right (Ctrl) click and then choose "Inspect" to see what DIV element you're looking at
Step 4: Find its data in the Inspector window that will open next to the console. You are looking for the class name for this particular item and its closest "parent" (ascendant element).
In this case, we are interested in the English phrase examples, which we want to import to InfraNodus to visualize the typical expressions as a graph. So we are looking what is the class structure where they appear:
example
trg ltr
and
text
classes
Stage 2.a: using a custom Javascript code to extract the web page data
Step 5: You can use a custom Javascript code to extract the data from all the elements of a certain HTML class. In the previous section, you identified the name of the class of the elements you need, so you should just replace with that name the `.class-name` below:
var elements = document.querySelectorAll('.your_class_name_here');
var texts = Array.from(elements).map(function(element) {
return element.textContent;
});
console.log(texts);
Step 6: If that doesn't work and returns an empty element, you can use another, more advanced code for a situation when your content is dynamically loaded using "shadow DOMs":
the example below works for extracting the comments to Financial Times articles
// Get all elements in the page
let allElements = Array.from(document.querySelectorAll('*'));
// Filter for those that have a shadowRoot
let shadowRootElements = allElements.filter(e => e.shadowRoot);
// Iterate over shadowRoot elements and query for '.coral-comment-content'
let textContents = [];
shadowRootElements.forEach(element => {
let comments = element.shadowRoot.querySelectorAll('.coral-comment-content');
comments.forEach(comment => textContents.push(comment.textContent));
});
console.log(textContents);
Step 7: Once this is done and you see the text you need, just copy this object by right-clicking your mouse and proceed to the Stage 3 below. If something didn't work, you can use Stage 2.b or Stage 2.c to extract your data.
Stage 2.b: using ChatGPT to generate the web scraping code for you
If the code above didn't work, you can try to use ChatGPT to write the extraction code for you. To do that,
Step 5: Log in on ChatGPT and use the following prompt:
Write a plain Javascript code that would extract text from all the elements with class .comments-comment-item__main-content on an html page
Step 6: ChatGPT will generate the code for you.
the example below works for extracting comments from a LinkedIn post:
var elements = document.querySelectorAll('.comments-comment-item__main-content');
var texts = Array.from(elements).map(function(element) {
return element.textContent;
});
console.log(texts);
Step 7: Copy that code and paste it into your browser's console. Press Enter to run it. You'll see the result. Then follow the instructions in Stage 3 to copy and paste the result into InfraNodus. If the code doesn't work, tell ChatGPT it doesn't work and ask it to fix it.
Stage 2.c: using artoo bookmarklet (based on jQuery) to scrape the web data
Step 5: Once you identify the classes, we will use JQuery to extract all the text from those classes into a variable. In order for it to work, we need to load JQuery first. To do that, you can use the open-source Artoo bookmarklet. Just add it into your browser and then click the button to have access to JQuery.
Step 6: You might want to apply some additional functions to trim the data and separate it into sentences:
console.log($('body').find('.example').find('div[class="trg ltr"] > span[class="text"]').text().trim().replaceAll('\n','').split('.'))
execute this line by pressing Enter
Step 7: You'll see the resulting text in your browser's console
Stage 3: Getting the Extracted Data into InfraNodus
Step 8: You will see a list of all the text inside this class extracted as an "object". Point your mouse to that object, Right click (or Ctrl click) and choose Copy Object — all the data will be copied into your clipboard.
Step 9: Open InfraNodus text network analysis tool and paste the contents of your clipboard into the editor.
Step 10: You will get a nice graph visualization of the concepts that tend to be used with "martial arts" in English.
Step 11: Remove the "art" and the "martial" concepts from the graph, so you can see the context around. To remove these two concepts, select them and then click the trash button at the top right (see deleting the nodes from the graph)
Comments
0 comments
Please sign in to leave a comment.