In the ideal world, it should be possible to export any data we find on the web in an easily comprehensible format for further processing.
However, when it is not possible, you have to get "hacky" and get that data yourself. Mind you, this may be against a website's terms of use, so we do not encourage the approach that we demonstrate here. Rather, it's just a technique that you may find useful whenever a simple copy and paste doesn't work and when it is legal to use.
Your best friend, in this case, is the Javascript console that can be accessed via View > Developer > Javascript console in Google Chrome browser. Once you open the console, you can manipulate the web page in any way you want.
For example, suppose you would like to import the data from Reverso.Context into InfraNodus to analyze the semantic field around a certain concept or a phrase.
Reverso Context is a resource that provides the examples of how certain words or phrases are used in specific contexts. There is no API in Reverso and copying and pasting the data doesn't work, because you get the phrases in both languages and it's not so useful for analysis.
What to do instead?
Option 1: Manual Scraping Using JQuery and Artoo Bookmarklet
Step 1: Open the Google Chrome Javascript console (View > Developer > Javascript Console)
Step 2: Point to the element that you're interested in
Step 3: Right (Ctrl) click and then choose "Inspect" to see what DIV element you're looking at
Step 4: Find its data in the Inspector window that will open next to the console. You are looking for the class name for this particular item and its closest "parent" (ascendant element). In this case, we are interested in the English phrase examples, which we want to import to InfraNodus to visualize the typical expressions as a graph. So we are looking what is the class structure where they appear:
example
trg ltr
and
text
classes
Step 5: Once you identify the classes, we will use JQuery to extract all the text from those classes into a variable. In order for it to work, we need to load JQuery first. To do that, you can use the open-source Artoo bookmarklet. Just add it into your browser and then click the button to have access to JQuery. You might want to apply some additional functions to trim the data and separate it into sentences:
console.log($('body').find('.example').find('div[class="trg ltr"] > span[class="text"]').text().trim().replaceAll('\n','').split('.'))
execute this line by pressing Enter
Step 6: You will see a list of all the text inside this class extracted as an "object". Point your mouse to that object, Right click (or Ctrl click) and choose Copy Object — all the data will be copied into your clipboard.
Step 7: Open InfraNodus text network analysis tool and paste the contents of your clipboard into the editor.
Step 8: You will get a nice graph visualization of the concepts that tend to be used with "martial arts" in English.
Step 9: Remove the "art" and the "martial" concepts from the graph, so you can see the context around. To remove these two concepts, select them and then click the trash button at the top right (see deleting the nodes from the graph)
Comments
0 comments
Please sign in to leave a comment.