Thursday, March 29, 2018

Newspaper module of Python



In python, there is an awesome module to extract and parse newspaper arcticles named newspaper. This artcile is to get the initial taste of that module.

IMPORT NEWSPAPER Module

we assume newspaper is installed as a Python module (in my case I'm using Newspaper3k on Python3). Now we will import the module:

import newspaper

SET THE TARGET PAPER

For our testing purpose, we want to look at articles published in the Technology section of the Guardian. The first step was to build the newspaper object, like so:

techno = newspaper.build('https://www.theguardian.com/uk/technology')

ARTICLE EXTRACTION

We just wanted to extract a recent article, using the following code  :

target_article = techno.articles[1] 
target_article.download()

The first line stores the first article in a variable called target_article. The second line downloads the article stored in that variable. 

Printing the result with print(target_article.html) just spews out the entire HTML to the console, which isn't very helpful. But, the brilliant thing about newspaper is that it allows us to parse to article and then run some simple natural language processing against it. 

PARSING ARTICLE

Now that we've downloaded the article, we're in a position to parse it:

target_article.parse()

This in turn enables us to target specific sections of the article, like the body text, title or author. Here's how to scrape body text:

print(target_article.text)
This will print only the body text to the console. 

WRITING THE BODY TEXT TO A FILE

The body text isn't that helpful to us sitting there in the console output, so let's write the output of target_article.text to a file:

First off, import sys

import sys
Then,

f = open( 'article.txt', 'w')
f.write(target_article.text)

No comments: