Friday, August 14, 2015

We talk to Allan about NewsLink

This blog post presents our new transform hub item called NewsLink that we have just released on the Transform Hub. NewsLink aims to assist in identifying and monitoring patterns in information posted on the Internet from a wide range of sources including Twitter, blog posts and news articles.

Every day millions of news articles, blog posts, Tweets, pastes, etc. are posted online with this continuous stream of information it makes it difficult to identify what information is important to us and should be focused on and what could just be ignored.  One approach to pick out important information would be to look at when multiple sources all mention the same people, locations, company names (and a slew of other types of entities) in a certain time period. This is the basis for NewsLink.

The image of the graph below is a small piece of a graph that was monitoring news articles related to Defcon. The snippets on the right list the news articles that mention both Samy Kamkar and Defcon on the same page. This is an example of what we will be working towards in this blog post.



This blogpost will be broken down into a couple of sections. Firstly we'll look at transforms that are used in NewsLink to gather your information from different sources. We'll then move on to the transforms that are used to extract entities and keywords from these web pages as well as calculate the page’s sentiment towards that topic. The last step is to automate this process with the use of Machines. Using Machines you can continuously monitor your search term and only be alerted by email when something of interest occurs.

Transforms that gather information
We have four new transforms for gathering information from different sources - two of these transforms get information from Twitter and the other two get information from websites using search engines.

Search for News Articles [using Bing]

The first transform we have is called Search for News Articles [using Bing] and is used to gather recent news articles relating to a specific search term from unspecified news sources. The transform uses Bing’s news search API and will return articles from a wide range of news websites that are indexed by the search engine. The starting point for this transform is a phrase entity where you will enter your search term as seen in the image below (1). After running the transform a transform settings will pop-up allowing you to limit your results according to the age of the articles and its news category (2). Defcon is in the news currently so let's see some articles relating to the con that have been posted in the last 7 days. You can use a numerical value followed by 'd' for days, 'h' for hour, 'w' for weeks and 'm' for minutes.



The next image shows the results from this search. Each entity that is returned represents a website that has been posted about Defcon in the last 7 days. Clicking on one of these websites and having a look in the Detail View will provide you with more information on the article as seen below:

(Dated: 24 Jul 2015)
Search for Websites [Using GCSE]

The next new transform we have is called Search for Websites [Using a GCSE] and is slightly more flexible than the former as it allows the user specify a list of sites to search and only returns results from those sites. The transform uses a [Google Custom Search Engine] (GCSE) ID as a transform setting to specify the list of sites that you want it to search. To use this transform you first need to create a GCSE with the lists of the websites that you want to monitor. This list could really be anything from your favorite security blogs to a list of influential financial news services. Once created you will receive a unique ID for your GCSE which is what you will use as a transform setting when running the Search for Websites [Using GCSE] transform. The example image below shows the list of websites we have included in our GCSE (1) as well as the settings that are displayed when running this transform. These settings include the GCSE ID as well as the maximum age of the pages you want returned (3). In this case the setting can be populated with 'd' for days, 'w' for weeks and 'y' for years followed by a numeric value (hours and minutes are unfortunately not supported by the API).




The next image shows the results from this search in which you will notice that only sites included in our list were returned. By clicking one of these entities and looking in its Detail View you'll see that the different pages from the relevant websites are displayed.

(Dated: 24 Jul 2015)
Expanding websites to the actual web pages

Next we want to get all these webpages out into their own URL entities to work with them separately. To do this we run To Pages from Website which results in the graph below:

(Dated: 24 Jul 2015)
Clicking on one of these URL entities shows details of the webpage including the sentiment of the text as seen above.

Before we start running further transforms to process these articles we should speak about our Twitter transforms that that can be used to get Tweets on specific topics or from specific users.

To Tweets [Search Twitter]

The first Twitter transform is called To Tweets [Search Twitter] which has actually been available in Maltego for quite some time and can be found in the PATERVA CTAS transform seed. This transform simply searches for Tweets that mention your search term. The image below shows running the transform on the hashtag Defcon with the transform slider set to 50:

24 Jul 2015

This transform is a very general search as it will search all of Twitter for Tweets made by any user. Most of the time you won’t actually be interested in what the common folk on Twitter have to say about your search term, instead you would like to only search for your topic from specific list of Twitter accounts.

Fortunately Twitter allows users to create lists of accounts and then search for Tweets by users in these lists. You can create your own lists of Twitter users from your Twitter profile and then access that list in Maltego by finding your Twitter profile and running the transform To User Lists [That this person owns].  Paul's Twitter account contains a public list of Twitter accounts belonging to news sites that he believes to be quite influential/popular. To find this list you will first need to find his Twitter account which can be done by searching for his alias and running the transform To User Lists [That this person owns] to see his lists. From the user list entity you can see which Twitter accounts are included in the list by running the transform To Twitter Affiliation. The image below shows the steps to get the list and the users in the list:




 To Tweet [Written by user list member]

Next up we want to monitor this list of accounts and return Tweets to our graph whenever our search term is mentioned by anyone of these users. To do this we run the transform To Tweet [Written by user list member] on the user list entity (1). A transforms setting window will pop up allowing you to specify your search term as well as specify a term to ignore Tweets by (see 2). You can also specify the maximum age of the Tweet that you want to be returned. This is entered in seconds in the first transform setting field as show in the image below:



The search above results in only ten Tweets by users in Paul's list that mentioned Greece in the last week (604 800 seconds). (3) You can see the details of each Tweets by having a look in the Detail View.

If you didn't want to search a specific topic but instead wanted all the Tweets by the users in your list you could run the same transform leaving the two transform settings, Tweets that don't contain and Tweets that contain, blank which will return all the user's Tweets in the specified time.

These four transforms are what we use to gather our information from the web and from Twitter with two of them allow you to get results from very specific sources (eg: from your Twitter lists or from your GCSE) and the other two allowing you to get results from a wide range of sources (eg: all the users on Twitter or all pages indexed by Bing's news search ). The table below summarizes how these four transforms can be categorized:


Processing the information we've mined
Now that we have our information collected we can do some interesting operations on the data to find where different sources are mentioning a common entity (like a person's name..and then some) We will also look at the sentiment across the different sources on an entity to determine that entity's “average sentiment”.

Let's return to our previous Defcon graph where we got related news articles by running the transform Search News Articles [using Bing].  We've run the transform To Pages from Website to get the different news articles out into their own URL (webpages) entities.

From here there are a few options for transforms to run on these URL entities. The first transform is called To Related Words with Sentiment and is used to extract uncommon words from webpages. The words need to be within a certain distance of your search term in order for them to be returned. This distance between the extracted word and our search term is specified in a transform setting. This same transform can also be run on our Tweet entities although you won't need to specify a sentence distance as the transform will look at the entire Tweet. There are two other settings for this transform which are used to specify a list of words to ignore and another to specify a list of words that should always be returned if found on the webpage or in the Tweet.

The next transform we have for processing our information is called To Entities with Sentiment and uses Named Entity Recognition (NER) to identify different entities that are mentioned anywhere on the webpage. The transform will look for things such as peoples’ names, company names, countries, cities, etc. It will also extract the sentiment of that entity and return it in its Detail View. This same transform can be run on our Tweet entities too.

If you want to be more specific and only return entities that are found a certain sentence distance from your search term you can then use the set of transforms To Related Entities. These transforms take in a transform setting that specifies a maximum sentence proximity between the found entity and your original search term on the page - it thereby reduces the amount of irrelevant results that are returned to your graph. Running the To Related Entities transform set on our URLs that mention the term ‘Defcon’ and specifying a maximum sentence proximity of "1" results in the graph below.

All the nodes at the bottom of the tree are entities extracted from the various webpages and appear within 1 sentence of the word ‘Defcon’ on our page.


Viewing this type of information in the Main View is not ideal as it is very difficult to see where multiple pages link to the same entity which is what we are looking for. The next image is of the same graph but in Bubble View using the new DiverseSentiment Viewlet (included in this post, but not available by default - please install manually). This viewlet will be explained next:



In this view entities are sized according to how many incoming links they have making it easier to identify entities that are mentioned across multiple news sources. Entities relating to a common topic will also cluster together on your graph. For the NewsLink Hub Item we created a new Viewlet called DiverseSentiment which colours nodes on the graph according to their average sentiment - the more red the entity is the more negative it is and the greener it is the more positive.

The sentiment for an entity is calculated by taking each sentence that the entity was mentioned in from the various different sources and then averaging the sentiment across all the articles. To calculate this sentiment we use a great service from [AlchemyAPI] which gets the targeted sentiment of each entity in each information source. The image below shows an entity from this search in more detail. It has quite a negative "average sentiment" from the three articles it was mentioned in (this graph was created on the 24 Jul 2015):

(Dated: 24 Jul 2015)

Automating the process with machines
So far what we have done has been a manual process but what we really want is to build a machine that automatically fetches information from various sources every [n] minutes, runs our word processing transforms on the data and then only alerts us when anything interesting happens on our graph by sending us an email, bookmarking the entity or performing some other action to alert the user.

For each of these new transforms we have a new perpetual Machine that automates the process of running these transforms and can be used to continuously monitor websites for activity. Each Machine is essentially broken down into three phases. Initially your information is collected with one of the "information gathering" transforms discussed earlier. Transforms are then run to pull out related entities and uncommon words that are mentioned on the webpage in close proximity to your search term. The last phase of the machine is to deletes old entities from your graph that are out of your monitor's time window and then sets up email alerts for when a new topic being mentioned by multiple sources.

Another new transform we have is called Email Alert Message which takes in an email address (or list of email addresses) as a transform setting and sends an email alert message to those addresses when the transform is run. This transform is used in our new machines to alert the user when a specific event happen on their graphs. By default the email alerts are commented out in the Machine scripts.

The machines also use different coloured bookmarks to indicate which iteration of the Machine an entity was returned in - red bookmarks indicate that the entity was returned in the most recent iteration, orange for the previous iteration and so on.

The names and descriptions of the four machines are below:

  • General News Source Monitor - This machine will search for news articles relating to a certain topic using the Search for News Articles [using Bing] transform. It will then run the language processing transforms on the results to extract related words and entities.
  • GCSE Term Monitor - This machine uses the transform Search for Websites [Using GCSE] to search a list of websites for a specific term. It will then run the language processing transforms on the results to extract related words and entities.
  • Twitter Monitor V2 - This machine will start by searching a specific phrase on Twitter and then extract entities found in the Tweet, uncommon words, hashtags, links and Twitter handles.
  • Twitter List Monitor - This machine is similar to the former however it will only return tweets from a specific list of Twitter users by using the transform To Tweet [Written by user list member].

Opening up the script for any of these new Machines you will see at the top there are a couple of variables you can configure for your monitor which are explained below:
  • incoming_link_count - This variable specifies how many incoming link an entity will need before an email alert is sent or before the entity is bookmarked.
  • ignore_words - This is a comma separated list of words/entities that you want the transforms to ignore in results. For instance if you were monitoring Defcon you wouldn't want to be alerted every time terms like 'BlackHat', 'hacker' or 'Las Vegas' were mentioned close to your search term. You can achieve this by include these in your ignore list.
  • through_words - There are some words that you will always want to have returned if they are mentioned close to your search term somewhere on the web, these words should be included in the through_words list. For instance if you were monitoring a stock you could include the words 'buy', 'sell' or 'hold' in the through_word list.
  • timer - Timer will specifies the time between iterations of your machine and is measured in seconds.
  • max_age - This specifies the maximum age an entity can be on your graph before it is deleted.
  • email_address - An email alert will be sent to this address when an alert is triggered. 

One last note about DiverseSentiment: the new Viewlet won't be downloaded when you install the NewsLink hub item but you can get it here (http://www.paterva.com/SentimentViewlet.mtz) and manually import it into your Maltego client.

Newslink aims to provide a flexible way of monitoring news, websites and Tweets and then alert the user of what is most important by identifying where multiple sources are mention the same words or entities.

As always, enjoy responsibly,
PR

Transforms reference

For gathering information:

  • Search for News Articles [using Bing] - This transform will search for news articles that are indexed by Bing relating to a specific topic. The transforms has two transform settings: one for specifying the maximum age of articles that should be returned and one for specifying the news category of the results. The age of the articles should start with a numeric value and be followed by either 'm' for minutes, 'h' for hours, 'd' for days or 'w' for weeks.
  • Search for Websites [Using GCSE] - This transform will search for a specific term on a custom list of websites specified in a Google Custom Search Engine (GCSE). The transform has three transform settings: one to specify the age of results, one to specify you GCSE and another to specify whether or not pages without a publish date should be returned. The maximum page age should begin with either a 'd' for days, a 'm' for months or 'y' for years followed by a numeric value.
  • To Tweets [Search Twitter] - This transform searches Twitter for a specify phrase.
  • To Tweet [Written by user list member] - This transform returns Tweets from a specific list of Twitter users, it has three transform settings: one to specify the age of the Tweet (in seconds), one to specify a search word and one to specify words to ignore Tweets by.

For extracting information:

  • To Entities with Sentiment - This transform will return all entities found on the entire page including that entities targeted sentiment. [This transform can be run on URLs and on Tweets entities].
  • To Related Entities - This is a transform set with transforms that will only return entities found in a specific sentence proximity to your search term. This sentence proximity is specified in a transform setting. The transforms in this set include:  To Related Companies,  To Related Countries, To Related Cities, To Related People , To Related Financial Market Index , To Related States Or Counties , To Related Organizations , To Related Technologies and To Related Field Terminology.
  • To Related Words with Sentiment - This transform will look for uncommon words that are mentioned in close proximity to your search term. [This transform can be run on URLs and Tweet entities].

For alerting the user:

  • SendEmailAlert - This transform will alert the user by sending an email when multiple sites point to the same term.

No comments:

Post a Comment