Getting information from the internet into a chatbot’s output can be very useful at some times. Not only to show continuously changing values, like weather information, but it can potentially also be used for learning, although the latter is obviously a little trickier.

Retrieving, or scraping info from the internet can be done remarkably easy with the chatbot designer. Here’s a screencast of a bot that retrieves weather information from the google weather api.

In the video, a .net plug-in is used to retrieve information from the internet by means of XPaths. This plugin is included by default in the application. Note though that plug-ins are only supported in the pro version.  Basic users will be able to use these projects, but they can’t create or edit any patterns that rely on plug-ins. Also, plug-ins are loaded on a project by project basis. So if you want to use the scraping features in your own project, you will first need to make certain that the correct .net functions have been loaded. Once this has been set up though, all plug-ins will Captureautomatically be loaded when the project is opened.

Loading

To load a plug-in, go to view/communication channels/OS. This will bring up a view like the one on the right. From here, you can load and unload dlls, classes and functions. First up is the dll. This can be loaded with one of the buttons on the toolbar. The first one gives access to the cache (dlls that have already been loaded). With the next button, you can select a file from disk. Note that, even though the ‘CmdShell.dll’ file (which contains the scraping functions) is part of the installation, it isn’t guaranteed that it’s already loaded in cache, so you might have to select it from the ‘program files/Chatbot designer pro/’ path. By the way, you can remove a dll by selecting it and pressing delete. Functions can be selected/deselected with the checkbox in front of the name. You can alternatively (de)select the entire class or lib at once. Notice the blue label behind each function name: this is the name that you can use in the patterns. You see, the do-patterns evaluator has no knowledge whatsoever of namespaces, classes or functions, it just knows a single name. This means that all function names should be unique across a single project. If you try to enter a duplicate name, a red box will be displayed round the newly mapped name.

There are quite a few functions available for scraping. Basically though, there are 3 groups: some functions to open/close web-pages, some functions to get data from those opened pages and finally the same functions that don’t require you to first open/close any files but which can do a scrape directly.

Short scrapes

Depending on how much data you need to retrieve, you can use one or the other technique. If there is only 1 xpath that you have to run on a page, then you can probably best use the short/direct functions that don’t require you to first open the web-page. Instead the address is supplied as an argument, together with the xpath. Here’s a list of the available quick scrapers:

Name Arg 1 Arg 2 result
ScrapeText file or web path XPath 0, 1 or more text values
ScrapeInt file or web path XPath 0, 1 or more int values
ScrapeDouble file or web path XPath 0, 1 or more floating point values
ScrapeDate file or web path XPath 0, 1 or more dates

And a short usage example to get the temperature info from the google API for a city that’s defined in ‘$place’:

$value = ScrapeText(“http://www.google.com/ig/api?weather=$place:interleaf(+)”, “/xml_api_reply/weather/current_conditions/temp_c/@data”)

As you can see, the first argument specified the web-page to open. The second is an xpath to the data attribute of the ‘temp_c’ element. Note that we use ‘:interleaf(+)’  cause the google API expects city-names that contain multiple words to be separated with a ‘+’ like: New+York.

More scraping

The second scraping method is primarily useful if you need to run multiple xpaths on the same content. In this case, it’s far more economical to first retrieve the page, run all the queries on the cached file and finally, when done, release it again. This can be accomplished with the remaining scrape functions.

You open a file or webpage with either ‘OpenScraper’ or ‘OpenScraperHTML’. The first works on xml content, the second on html. That is, the second will convert html to xml so that the xpath can be run on it. Both return an integer that needs to be used in subsequent calls. Basically, the integer replaces the filename as a reference. It allows you to have multiple files open and to have the system run multi-threaded and let it serve multiple people at the same time.

The scraping functions themselves are almost identical as the quick versions, except that they take an integer as first argument instead of a path. Other then that, usage is exactly the same, with the same types: one for text, integers, doubles and dates.

Once you are done with the file, you have to call ‘CloseScraper’ with, as argument, the integer that was returned by ‘OpenScraper(HTML)’, so that resources can be cleaned up. This is important, if you forget to do this, the system will eventually buckle, crack and give up.
In a normal usage situation, you would do a short salvo: open a page, do a few scrapes and close it again, all in 1 block, but this is not a requirement, you can keep the page open across multiple inputs. As long as you maintain a reference to the scraper (the integer) somewhere in memory so that you don’t loose track of it.

Html scraping

As already mentioned, html scraping is done by first converting the page into xml before the xpath is executed. This conversion can cause some ‘changes’ in the structure of the file. In other words, the path that you would calculate, based on the html file might not be correct for the xml version. This means that you best build your xpaths based on the xml version of the HTML pages.

The conversion routine that’s internally used by the chatbot designer is based on the SGMLReader library. This provides a command-line tool to manually convert html to xml files. This can be very useful for building the correct query. I’ve included a direct download for the command line html to xml conversion tool. Here’s a short description on how to use it (taken from the original documentation):

sgmlreader <options> [InputUri] [OutputFile]

-e “file” Specifies a file to write error output to. The default is to generate no errors. The special name “$stderr” redirects errors to stderr output stream.
-proxy “server” Specifies the proxy server to use to fetch DTD’s through the fire wall.
-html Specifies that the input is HTML.
-dtd “uri” Specifies some other SGML DTD.
-base Add an HTML base tag to the output.
-pretty Pretty print the output.
-encoding name Specify an encoding for the output file (default UTF-8)
-noxml Stops generation of XML declaration in output.
-doctype Copy <!DOCTYPE tag to the output.
InputUri The input file name or URL. Default is stdin. If this is a local file name then it also supports wildcards.
OutputFile The optional output file name. Default is stdout. If the InputUri contains wildcards then this just specifies the output file extension, the default being “.xml”.

Examples:

sgmlreader -html *.htm *.xml
Converts all .htm files to corresponding .xml files using the built in HTML DTD.

sgmlreader -html http://www.msn.com -proxy myproxy:80 msn.xml
Converts all the MSN home page to XML storing the result in the local file “msn.xml”.

sgmlreader -dtd ofx160.dtd test.ofx ofx.xml
Converts the given OFX file to XML using the SGML DTD “ofx160.dtd” specified in the test.ofx file.

Building an XPath

Once you have your xml file, getting the xpath to the element that you want can still be a little challenging. Html files simply aren’t designed with this type of usage in mind (and hey, if it can be easier for xml files, why not). Enter FireBug, an add-on for Firefox that allows developers to get a closer look at the html…. Or xml. After you have installed firebug and loaded up the xml file into firefox, go to tools/Web developer/Firebug/Open firebug so that you can see the debug panel. In this panel, select the element that you which to query, open the context menu and select ‘copy XPath’. And that’s it, simply paste this path in the chatbot designer and your done.

 

A final pre-release video on how you can call .net functions from within your chatbot. The idea behind this feature is to allow you to extend your chatbot with custom features. This will only be available in the pro version though.

Note: the video is best viewed in max resolution and full screen to see all the details.

 

Check out this first ‘AI’ feature that can be done using only 1 rule and, if needed, some thesaurus lookups.

I’ve been having a huge smile on my face all day Laughing out loud

For the interested, here’s a screenshot of the rule that enables this trick (click to enlarge):

Capture

The important bit is the :complete after the variable $ToComp which performs the calculation.

Here’s another screencast that shows what’s happening behind the scenes (basically, it’s a walkthrough of the neural code in the designer):

 

Check out the new character, called ‘Mika’:

Pretty cool He!. I think so as well. The character is another of Laticis Imagery’s creations. Ady provided all the images and I assembled them into a single character. The video demonstrates all the available expressions, which can be activated in the output using ‘mark’ ssml tags.

Perhaps some more information on the project: The first release, the basic version should be ready in a short while now, when I have created some content (which should also be a perfect opportunity to work out some of the final details). The basic edition will be a free (as in beer) version. After that, the pro will be prepared for release which will contain some more functionality like user interface automation, and/or home automation (not certain yet what to do first).

Some of the features that will be available in the first release:

  • Select if the bot starts the conversation or waits for some input on startup. Opening statements can be declared in the bot’s properties page.
  • You can declare custom memory operations that need to be performed each time the bot starts.
  • ‘Do patterns’ are also executed each time output was generated.
  • Input repetition is recognized (stored in memory as a counter) and can be handled with custom, conditional output patterns.
  • When no patterns matched, the system will use one of the custom fallback outputs.
  • Input patterns are grouped together into a single rule. These patterns share the same set of possible output patterns.
  • Multiple output patterns can be declared for a single rule. You can select if a random item needs to be selected from the list or if each item needs to be used in sequence (useful for story telling bots).
  • Each rule can have it’s own do patterns, which are used to manipulate the memory.
  • Rules are grouped together in topics (the 2 files that are imported in the video, each represent a topic), which are responsible for providing context. This allows you to declare the same pattern in multiple topics (useful for short statements like ‘why, when, yes, no,…’
  • Additional context can be added through do patterns and can be queried in conditions.
  • It’s possible to declare conditional questions at the level of a topic, meaning that multiple output patterns can share the same questions. The first one who’s condition matches will be used for outputs that don’t declare their own question.
  • A single output pattern can link to other output patterns, indicating that It should be used if the rule it belongs too, is the answer to a question declared in one of the linked outputs. This is useful to properly handle responses or when the user doesn’t respond as expected.
  • Time and date are supported in the output and conditionals through a variable. When used in combination with the thesaurus, some pretty powerful things can be done.
  • Test-cases for running automated tests on your bot.
  • Synonyms are automatically resolved in the input. This is a very powerful feature that’s able to recognize and replace compound words in the input. For instance,  if an input pattern contains ‘what is’ and the system knows the synonyms for ‘what is’ are ‘whats, what’s, wats, wat is, wat’s’, then you only need to declare 1 input pattern to recognize all of the possible synonyms.
  • Synonyms can be managed from the thesaurus editor.
  • The following operators can be used in the input patterns:
    • () group input together
    • [] option: words between the brackets are optional, not required to be present in the input
    • {} loop: words between the brackets can be found 0, 1 or more times (useful for lists)
    • | choice: the input needs to contain either the left part or the right part of the choice. This can be combined with an option, group or choice, like: [I | you | he | she | we | they]
    • $name: variable declaration: collects words that can be used in the output or conditions.
    • ^path: thesaurus variable declaration: the input needs to contain a word (or compound) that is a child of the specified thesaurus path (very powerful). The actual collected word can be used in the output/conditions like a regular variable.
    • && the and operator allows you to declare groups of words that need to be present in the input, but which can have ‘holes’ in between them, ex: (hello) && (what’s your name)
  • conditions and outputs can also use:
    • #path: declares a data-path into the memory.
    • ~name: to reference topics.
  • There is a built in topic-editor or you can edit them directly in xml format.
  • The built-in topic editor has a spell checker.
  • Patterns with errors have a red line, making them easy to find. The error text can be seen as a tooltip or in the log.
 

Note: Deprecated!

The next screen cast is ready. It demonstrates some of the features found in the ‘asset’ editor.

 

Note: Deprecated!

I’ve uploaded the first 2 in a set of screencasts on how to use NND. They can be reached from the Quick start page or you can view them directly from here:

Hope you liked them.

© 2012 Neural Network Design blog Suffusion theme by Sayontan Sinha