Using the Domain Discovery Tool

Now you should be able to head to http://<hostname>:8084/ to interact with the tool.

Create Domain

alternate text

Begin by adding a domain on the Domains page (initial page), shown in the figure above, by clicking on the add_domain button. Domain maintains context of domain discovery.

alternate text

On the Adding a domain dialog shown in figure above, enter the name of the domain you would like to create, for example Ebola, and click on Submit button. You should now see the new domain you added in the list of domains as shown below.

alternate text

Once domain is added click on domain name in the list of domains to collect, analyse and annotate web pages.

Domains can be deleted by clicking on the delete_domain button.

alternate text

On the Deleting a domain dialog select the domains to be deleted in the list of current domains and click on Submit button. They will no longer appear on the domains list.

NOTE: This will delete all the data collected for that domain.

Acquire Data

Continuing with our example of the Ebola domain, we show here the 3 methods of uploading data. Expand the Search tab on the left panel. You can add data to the domain in the following ways:

Upload URLs

If you have a set of URLs of sites you already know, you can add them from the LOAD tab. You can upload the list of URLs in the text box as shown in figure below:

alternate text

Enter one URL per line.

You can also upload a file with the list of URLs by clicking on the LOAD URLS FROM FILE button. This will bring up a file explorer window where you can select the file to upload. The list of URLs should be entered one per line in the file. Download an example URLs list file for ebola domain HERE. The uploaded URLs are listed in the Filters Tab under Queries as Uploaded.

SeedFinder

Instead of making multiple queries to Google/Bing yourself you can trigger automated keyword search on Google/Bing and collect more web pages for the domain using the SeedFinder. This requires a domain model. So once you have annoated sufficient pages, indicated by a non-zero accuracy on the top right corner, you can use the SeedFinder functionality.

To start a SeedFinder search click on the SEEDFINDER tab.

alternate text

Enter the initial search query keywords, for example ebola treatment, as shown in the figure above. The SeedFinder issues this query to Google/Bing. It applies the domain model to the pages returned by Google/Bing. From the pages labeled relevant by the domain model the SeedFinder extracts keywords to form new queries which it again issues to Google/Bing. This iterative process terminates when no more relevant pages are retrieved or the max number of queries configured is exceeded.

You can monitor the status of the SeedFinder in the Process Monitor that can be be accessed by clicking on the pm_icon on the top as shown below:

alternate text

You can also stop the seedfinder process from the Process Monitor by clicking on the stop button shown along the corresponding proces.

All queries made are listed in the Filters Tab under SeedFinder Queries. These pages can now be analysed and annotated just like the other web pages.

Explore Data (Filters)

alternate text

Once some pages are loaded into the domain, they can be analyzed and spliced with various filters available in the Filters tab on the left panel. The available filters are:

Queries

This lists all the web search queries, uploaded URLs and seedfinder queries made to date in the domain. You can select one or more of these queries to get pages for those specific queries.

SeedFinder Queries

This lists all the seedfinder queries made to date in the domain. You can select one or more of these queries to get pages for those specific queries.

Crawled Data

This lists the relevant and irrelevant crawled data. The relevant crawled data, CD Relevant, are those crawled pages that are labeled relevant by the domain model. The irrelevant crawled data, CD Irrelevant, are those crawled pages that are labeled irrelevant by the domain model.

Tags

This lists the annotations made to data. Currently the annotations can be either Relevant, Irrelevant or Neutral.

Annotated Terms

This lists all the terms that are either added, uploaded in the Terms Tab. It also lists the terms from the extracted terms in the Terms Tab that are annotated.

Domains

This lists all the top level domains of all the pages in the domain. For example, the top level domain for URL https://ebolaresponse.un.org/data is ebolaresponse.un.org.

Model Tags

You can expand the Model Tags and click the Upate Model Tags button that appears below, to apply the domain model to a random selection of 500 unlabeled pages. The predicted labels for these 500 pages could be:

  • Maybe Relevant: These are pages that have been labeled relevant by the model with a high confidence
  • Maybe Irrelevant: These are pages that have been labeled irrelevant by the model with a high confidence
  • Unsure: These are pages that were marked relevant or irrelevant by the domain model but with low confidence. Experiments have shown that labeling these pages helps improve the domain model’s ability to predict labels for similar pages with higher confidence.

NOTE: This will take a few seconds to apply the model and show the results.

Extracted Terms Summary

alternate text

The most relevant terms and phrases (unigrams, bigrams and trigrams) are extracted from the pages in the current view of DDT and listed in the Terms Tab on the left panel, as shown in the figure above. This provides a summary of the pages currently in view. Initially, when there are no annotated terms, the top 40 terms with the highest TFIDF (term frequency-inverse document frequency) are selected. The terms are displayed with their frequency of occurrence in relevant (blue) and irrelevant (red) pages (bars to the right of the Terms panel). This helps the expert to select terms that are more discerning of relevant pages.

Terms can be tagged as ’Positive’ and ’Negative’ by 1-click and 2-click respectively. The tags are stored in the active data source. When the update terms button is clicked, the positively and negatively annotated terms are used to re-rank the other terms. Terms help the expert understand and discover new information about the domains of interest. The terms can be used to refine the Web search or start new sub topic searches.

Custom relevant and irrelevant terms can be added by clicking the + button to boost the extraction of more relevant terms. These custom terms are distinguised by the delete icon before them which can be clicked to delete the custom term.

Hovering the mouse over the terms in the Terms window displays the context in which they appear on the pages. This again helps the expert understand and disambiguate the relevant terms. Inspect the terms extracted in the “Terms” window. Clicking on the stop button pins the context to the corresponding term.

Create Model

DDT incrementally builds a model as the user annotates the retrieved pages. The accuracy of the domain model is displayed on the top right corner. It provides an indication of the model coverage of the domain and how it is influenced by annotations.

The domain model can be exported by clicking on the Model button on the top (this button will be dsiabled when there are no sufficient annotations to build the model and the model Accuracy of onlineClassifier: 0 %). This will show a drop down as shown in figure below:

alternate text

Click on Create Model to export the model. This should bring up a file explorer pop-up (makes sure you enable pop-up on your browser) as shown below. Save the compressed model file.

alternate text

This saved model file contains the ACHE classifier model, the training data for the model and the initial seed list required for focused crawling as shown in figure below:

alternate text

Annotation

Currently, pages can be annotated as Relevant, Irrelevant or Neutral using the tag_all buttons respectively to tag all pages in the current view. tag_one buttons can be used to tag individual pages. Annotations are used to build the domain model.

Note:

  • At least 10 pages each of relevant and irrelevant pages should be annotated to build the model. The more the annotations, hence the better coverage of the domain, the better the domain model.
  • Ensure that the relevant and irrelevant page annotations are balanced for a better model.

Run Crawler

Once a sufficiently good model is available or pages are uploaded for a deep crawl you can change from Explore Data View to the Crawler View shown below:

alternate text

The crawler view support a deep and focused crawl. The figure above shows the Deep Crawl View. The list on the left shows all pages annoated as Deep Crawl in the Explore Data View. The table on the right shows recommendations of pages that could be added to deep crawl by clicking on the ‘Add to Deep Crawl’. If keyword terms are added or annotated then recommendations are made based on the score of how many of the keywords they have. Otherwise the domains are reocommended by the number of pages they contain.

The ACHE deep crawler can be started by clicking on “Start Crawler” button at the bottom. This starts a deep crawler with all the pages tagged for Deep Crawl.

You can see the results of the crawled data in “Crawled Data” in the Filters Tab. When the crawler is running it can be monitored by clicking on the ‘Crawler Monitor’ button.

The figure below shows the Focused Crawler View:

alternate text

First, in the ‘Model Settings’ on the left select the tags that should be considered as relevant(Positive) and irrelevant(Negative). If there sufficient relevant and irrelevant pages (about 100 each), then you can start the crawler by clicking on the Start Crawler button. If there are irrelevant pages then a page classifier model cannot be built. Instead you can either upload keywords by clicking on the ‘Add Terms’ in the Terms window. You can also annotate the terms extracted from the positive pages by clicking on them. If not annotated terms are available then the top 50 terms are used to build a regular expression model.

Once either a page classifier or a regex model is possible start the focused crawler by clicking on the Start Crawler.

You can see the results of the crawled data in “Crawled Data” in the Filters Tab. When the crawler is running it can be monitored by clicking on the ‘Crawler Monitor’ button.

The Model info on the bottom right shows how good a domain model is if there are both relevant and irrelevant pages annotated. The color bar shows the strength of the model based on the balance of relevant and irrelevant pages and the classifier accuracy of the model.