Sunday, June 2, 2013

Scraping web pages and building menus

Introduction

This plugin will display the news from Canal Algerie, located at the following page: http://www.entv.dz/tvfr/video/index.php, The last six broadcasts are available in two languages, so our menu will list 12 videos, with a thumbnail, date and language. This site was chosen because the HTML generated is very regular and very easy to parse, and thus makes for a good example.

Getting Started

We start as we start every plugin, by creating a directory for our plugin, in this case 'plugin.video.canalalgerie', creating the icon file icon.png, ,a blank default.py and a new addon.xml, possibly copied from a previous project and modified accordingly:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<addon id="plugin.video.canalalgerie" name="Canal Algerie"
 version="1.0.0" provider-name="Abed BenBrahim">
 <requires>
  <import addon="xbmc.python" version="2.1" />
 </requires>
 <extension point="xbmc.python.pluginsource" library="default.py">
  <provides>video</provides>
 </extension>
 <extension point="xbmc.addon.metadata">
  <summary lang="en">Canal Algerie Journal Télévisé</summary>
  <description lang="en">Canal Algerie Journal Télévisé
  </description>
  <platform>all</platform>
  <language>en</language>
  <email>abenbrahim@comcast.net</email>
 </extension>
</addon>

Next we will look at source code for the web page to be scraped (http://www.entv.dz/tvfr/video/index.php). The section of interest is shown below (reformatted for clarity):

<div style="display: none;" id="tab5" class="tab_content">
 <table width="100%" border="0">
  <tr>
   <td align="left"><a href="index.php?t=JT20H_01-06-2013"
    target="_self"> <img src="../images/img_vid.jpg" width="90"
     height="60" /> </a>Samedi 01 juin</td>
   <td align="left"><a href="index.php?t=JT20H_31-05-2013"
    target="_self"> <img src="../images/img_vid.jpg" width="90"
     height="60" /> </a>Vendredi 31 mai</td>
   <td align="left"><a href="index.php?t=JT20H_30-05-2013"
    target="_self"> <img src="../images/img_vid.jpg" width="90"
     height="60" /> </a>Jeudi 30 mai</td>
  </tr>
  <tr>
   <td align="left"><a href="index.php?t=JT20H_29-05-2013"
    target="_self"> <img src="images/JT20H_29-05-2013.jpg"
     width="90" height="60" /> </a>Mercredi 29 mai</td>
   <td align="left"><a href="index.php?t=JT20H_28-05-2013"
    target="_self"> <img src="images/JT20H_28-05-2013.jpg"
     width="90" height="60" /> </a>Mardi 28 mai</td>
   <td align="left"><a href="index.php?t=JT20H_27-05-2013"
    target="_self"> <img src="images/JT20H_27-05-2013.jpg"
     width="90" height="60" /> </a>Lundi 27 mai</td>
  </tr>
 </table>
</div>
<div style="display: none;" id="tab6" class="tab_content">
 <table width="100%" border="0">
  <tr>
   <td align="left"><a href="index.php?t=JT19H_01-06-2013"
    target="_self"> <img src="../images/img_vid.jpg" width="90"
     height="60" /> </a>Samedi 01 juin</td>
   <td align="left"><a href="index.php?t=JT19H_31-05-2013"
    target="_self"> <img src="../images/img_vid.jpg" width="90"
     height="60" /> </a>Vendredi 31 mai</td>
   <td align="left"><a href="index.php?t=JT19H_30-05-2013"
    target="_self"> <img src="../images/img_vid.jpg" width="90"
     height="60" /> </a>Jeudi 30 mai</td>
  </tr>
  <tr>
   <td align="left"><a href="index.php?t=JT19H_29-05-2013"
    target="_self"> <img src="images/JT19H_29-05-2013.jpg"
     width="90" height="60" /> </a>Mercredi 29 mai</td>
   <td align="left"><a href="index.php?t=JT19H_28-05-2013"
    target="_self"> <img src="images/JT19H_28-05-2013.jpg"
     width="90" height="60" /> </a>Mardi 28 mai</td>
   <td align="left"><a href="index.php?t=JT19H_27-05-2013"
    target="_self"> <img src="images/JT19H_27-05-2013.jpg"
     width="90" height="60" /> </a>Lundi 27 mai</td>
  </tr>
 </table>
</div>

We notice that every news program is listed between <td align="left"> and </td> tags, and there are no other td tags on the page. Between the td tags, we find a link to a page containing the video, a thumbnail image and the  date/ We will infer the language from the link, if it contains 19H it is French, if it contains 20H, it is in Arabic.

Developing a Strategy

When our plugin starts with no parameters we will access the website page, scrape it to build a menu containing the 12 items. Each item will contain a link back to our plugin, with a parameter named 'play' and additional parameters specifying the video information, When our plugin is accessed with the play parameter present, we will load the video page and scrape it to extract the video link, and will play the video as demonstrated in the previous post.

Handling Parameters

Clearly, we first need to parse the parameters passed to our plugin: We will add a function called parseParameters to our util.py, since we will need this for most of our future plugins:
def parseParameters(inputString=sys.argv[2]):
    """Parses a parameter string starting at the first ? found in inputString
    
    Argument:
    inputString: the string to be parsed, sys.argv[2] by default
    
    Returns a dictionary with parameter names as keys and parameter values as values
    """
    parameters = {}
    p1 = inputString.find('?')
    if p1 >= 0:
        splitParameters = inputString[p1 + 1:].split('&')
        for nameValuePair in splitParameters:
            if (len(nameValuePair) > 0):
                pair = nameValuePair.split('=')
                key = pair[0]
                value = urllib.unquote(urllib.unquote_plus(pair[1])).decode('utf-8')
                parameters[key] = value
    return parameters
Our default.py, with playVideo and buildMenu stubbed out, now looks like this;
import util

def playVideo(params):
    pass

def buildMenu():
    pass

parameters = util.parseParameters()
if 'play' in parameters:
    playVideo(parameters)
else:
    buildMenu()

Accessing a web page

To build the menu, we need to access a web page and parse it. We will use the urllib2 library to access a URL (see Documentation and References below). This is the best choice if we only need to access one URL per invocation of the plugin, if we need to access multiple URLs on the same site, there are better choices that will be documented in future posts.
The code to open a URL and get the content is shown below:
import util, urllib2

def playVideo(pageUrl):
    pass

def buildMenu():
    url = WEB_PAGE_BASE + 'index.php'
    response = urllib2.urlopen(url)
    if response and response.getcode() == 200:
        content = response.read()
        print content
    else:
        util.showError(ADDON_ID, 'Could not open URL %s to create menu' % (url))


WEB_PAGE_BASE = 'http://www.entv.dz/tvfr/video/'
ADDON_ID = 'plugin.video.canalalgerie'

parameters = util.parseParameters()
if 'play' in parameters:
    playVideo(parameters['play'])
else:
    buildMenu()

Note that we print the content (line 11) , so that we can start debugging the plugin and ensure that we are on the right track. We should remove the print statements once we finish development, or what is printed will be added to the xbmc log at log level NOTICE.


    
def notify(addonId, message, timeShown=5000):
    """Displays a notification to the user
    
    Parameters:
    addonId: the current addon id
    message: the message to be shown
    timeShown: the length of time for which the notification will be shown, in milliseconds, 5 seconds by default
    """
    addon = xbmcaddon.Addon(addonId)
    xbmc.executebuiltin('Notification(%s, %s, %d, %s)' % (addon.getAddonInfo('name'), message, timeShown, addon.getAddonInfo('icon')))
This code includes error handling code in case the web page cannot be accessed, with a new procedure showError defined in util.py:
    
def showError(addonId, errorMessage):
    """
    Shows an error to the user and logs it
    
    Parameters:
    addonId: the current addon id
    message: the message to be shown
    """
    notify(addonId, errorMessage)
    xbmc.log(errorMessage, xbmc.LOGERROR)


Testing the add-on in the IDE

In order to test the plugin in the IDE, we need to add the xbmc stubs to the Python path, as described in a previous post. We also nend to define two of the three parameters:
  • the plugin URI, in this case plugin://plugin.video.canalalgerie. You  cannot set this, as this will be set to the path of your python file when you run in the IDE.
  • A plugin id, any integer will do, we use 0
  • the parameter string, even if there are no parameters, we will at least get a question mark
So for example, to test the plugin's buildMenu procedure, you would set the parameters to '0 ?'
Once this is done, we should be able to run the plugin and if everything went well, we should see the source fpr the web page in the IDE's console window.

Parsing the page

There are many libraries to handle the problem of parsing HTML, which may not be well formed. Libraries such as BeautifulSoup and ElementTree are very powerful, but for our simple needs, something we write ourselves will be faster and less resource intensive if we do not need the power and flexibility these libraries offer. We therefore define extractAll in util.py, a function that finds all occurrences of text between a starting token and ending token and returns an array of all such occurences:
def extractAll(text, startText, endText):
    """
    Extract all occurences of string within text that start with startText and end with endText
    
    Parameters:
    text: the text to be parsed
    startText: the starting tokem
    endText: the ending token
    
    Returns an array containing all occurences found, with tabs and newlines removed and leading whitespace removed
    """
    result = []
    start = 0
    pos = text.find(startText, start)
    while pos != -1:
        start = pos + startText.__len__()
        end = text.find(endText, start)
        result.append(text[start:end].replace('\n', '').replace('\t', '').lstrip())
        pos = text.find(startText, end)
    return result

Our buildMenu procedure in default.py now looks like this
def buildMenu():
    url = WEB_PAGE_BASE + 'index.php'
    response = urllib2.urlopen(url)
    if response and response.getcode() == 200:
        content = response.read()
        videos=util.extractAll(content, '<td align="left">', '/td>')
        for video in videos:
            print video
    else:
        util.showError(ADDON_ID, 'Could not open URL %s to create menu' % (url))

When we run the plugin, we should now see: 12 lines containing the extracted text.
We are now ready to extract the video title, the thumbnail image link and the video page link, the last step before displaying tthe menu. To extract a single substring, we define the extract function in util.py:
def extract(text, startText, endText):
    """
    Extract the first occurence of a string within text that start with startText and end with endText
    
    Parameters:
    text: the text to be parsed
    startText: the starting tokem
    endText: the ending token
    
    Returns the string found between startText and endText, or None if the startText or endText is not found
    """
    start=text.find(startText,0)
    if start!=-1:
        start=start+startText.__len__()
        end=text.find(endText,start+1)
        if end!=-1:
            return text[start:end]
    return None
The three pieces of information are extracted in our buildMenu procedure:
def buildMenu():
    url = WEB_PAGE_BASE + 'index.php'
    response = urllib2.urlopen(url)
    if response and response.getcode() == 200:
        content = response.read()
        videos=util.extractAll(content, '', '/td>')
        for video in videos:
            params={'play':1}
            params['video']=WEB_PAGE_BASE+util.extract(video, 'a href="', '\"')
            params['image']=WEB_PAGE_BASE+util.extract(video,'img src="','\"')
            params['title']=util.extract(video,'</a>','<')+' (%s)'%('Français' if '19H' in params['video'] else 'Arabe')
            print 'Title: ',params['title'],'\tVideo page:', params['video'],'\tImage: ', params['image']
    else:
        util.showError(ADDON_ID, 'Could not open URL %s to create menu' % (url))


Note that we store the extracted values in a dictionary, as this will be handy when we build a link back to the plugin to play the video. Also note that in order to include non ASCII characters (the c cedilla in Français), we need to use Unicode. This is not a problem because the parseParameters and the makeLink (described in the next section) that we defined in util.py fully support Unicode, unlike a lot of routines you will find in other existing plugins.

Building the menu

In order to build the menu, we define three helper functions, makeLink to create links back to our plugin,  addMenuItem to add a menu item to the xbmc menu and endListing to signal the end of the menu.

def makeLink(params, baseUrl=sys.argv[0]):
    """
    Build a link with the specified base URL and parameters
    
    Parameters:
    params: the params to be added to the URL
    BaseURL: the base URL, sys.argv[0] by default
    """
        return baseUrl + '?' +urllib.urlencode(dict([k.encode('utf-8'),unicode(v).encode('utf-8')] for k,v in params.items()))


def addMenuItem(caption, link, icon=None, thumbnail=None, folder=False):
    """
    Add a menu item to the xbmc GUI
    
    Parameters:
    caption: the caption for the menu item
    icon: the icon for the menu item, displayed if the thumbnail is not accessible
    thumbail: the thumbnail for the menu item
    link: the link for the menu item
    folder: True if the menu item is a folder, false if it is a terminal menu item
    
    Returns True if the item is successfully added, False otherwise
    """
    listItem = xbmcgui.ListItem(unicode(caption), iconImage=icon, thumbnailImage=thumbnail)
    listItem.setInfo(type="Video", infoLabels={ "Title": caption })
    return xbmcplugin.addDirectoryItem(handle=int(sys.argv[1]), url=link, listitem=listItem, isFolder=folder)

def endListing():
    """
    Signals the end of the menu listing
    """
    xbmcplugin.endOfDirectory(int(sys.argv[1]))
The buildMenu procedure now looks like this:
def buildMenu():
    url = WEB_PAGE_BASE + 'index.php'
    response = urllib2.urlopen(url)
    if response and response.getcode() == 200:
        content = response.read()
        videos = util.extractAll(content, '', '/td>')
        for video in videos:
            params = {'play':1}
            params['video'] = WEB_PAGE_BASE + util.extract(video, 'a href="', '\"')
            params['image'] = WEB_PAGE_BASE + util.extract(video, 'img src="', '\"')
            params['title'] = util.extract(video, '</a>', '<') + ' (%s)' % ('Français' if '19H' in params['video'] else 'Arabe')
            link = util.makeLink(params)
            util.addMenuItem(params['title'], link, 'DefaultVideo.png', params['image'], False)
        util.endListing()
    else:
        util.showError(ADDON_ID, 'Could not open URL %s to create menu' % (url))

This yields a menu that looks like this:

Playing the video

The code for the playVideo procedure is shown below without comment, since all concepts used have already been discussed:
def playVideo(params):
    response = urllib2.urlopen(params['video'])
    if response and response.getcode() == 200:
        content = response.read()
        videoLink = util.extract(content, 'flashvars.File = "', '"')
        util.playMedia(params['title'], params['image'], videoLink, 'Video')
    else:
        util.showError(ADDON_ID, 'Could not open URL %s to get video information' % (params['video']))



The plugin including full source code can be downloaded here.

Documentation and References


6 comments :

  1. Hi,
    I find your article very informative. Thanks very much. I ran the code on eclipse with the stubs 'https://github.com/Tenzer/xbmcstubs' and met the problem:
    'def parseParameters(inputString=sys.argv[2]):
    IndexError: list index out of range'
    It said the array argv has no element argv[2]. Is there anyway I can solve the problem.

    Regards

    ReplyDelete
    Replies
    1. 0 ?
      are the parsmeters. see the paragraph 'Testing the addon in the IDE'

      Delete
    2. Thanks for your reply. It works

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hello. Thank you for great article and example codes. Using infos from here, I fixed my first plugin to work perfect and do not prints erros on item changing in XBMC log file.

    I have a question. Can you write to me short e-mail and I send you in reply url to my second plugin for XBMC, which do not open RTMP streams. But obtained data with all urls and tokens, are also printed to log file with rtmpdump.exe command line. And when I execute it, stream is dumped ok.

    I also asked at XBMC forum, and waiting for new info on PM from one user. Because he looked on my code (which I want to be confidental at this moment) and do not see any errors righ now, but RTMP streams do not opens anyway. My e-mail addres is: oles[remove-th@t]o2.pl

    Thanks in advance and best regards, olesio :)

    ReplyDelete
  4. Good work...I m seeking for help to play my videos, I upload them to openload but I can get a link that work in Kodi, maybe I need to get a script or add a extension to the link I don't know can someone help ��

    ReplyDelete