How to Scape RSS News Feed from Google with Python (Full Code)

One critical factor when trading stocks is keeping up with the news. Going through news outlets and constantly managing what is noise versus what has material impact is difficult if you’re doing it manually.

A great way to streamlining this process is by scraping news headlines from Google and build a funnel to filter down the results. Here’s some code I’ve built to help you easily scrape news from Google.

How the Code Works?

  • The code intakes a list of keywords you want to search for
  • From they keyword search, it’ll return the headlines, the link to the article, python date object for when the article was posted, source of the article and your search term

Python Code

Install the required Python library

pip install pytz
pip install feedparser

Python function

def google_rss(keywords):
    # Google RSS News
    def parse_rss(rss_url):
        return feedparser.parse(rss_url)

    def get_headlines(rss_url):
        headlines = []

        feed = parse_rss(rss_url)
        for item in feed['items']:
            headlines.append(item['title'])

        return headlines

    def get_links(rss_url):
        links = []

        feed = parse_rss(rss_url)
        for item in feed['items']:
            links.append(item['link'])

        return links

    def get_pub_dates(rss_url):
        pub_dates = []

        feed = parse_rss(rss_url)
        for item in feed['items']:
            pub_dates.append(item['published'])

        return pub_dates

    def get_sources(rss_url):
        sources = []

        feed = parse_rss(rss_url)
        for item in feed['items']:
            sources.append(item['source']['title'])

        return sources

    headline = []
    link = []
    pub_date = []
    source = []
    rss_source = []
    keyword = []

    for x in keywords:
        url = 'https://news.google.com/rss/search?q=' + x + '+news'

        for a in get_headlines(url):
            headline.append(a)
            rss_source.append('google')
            keyword.append(x)

        for b in get_links(url):
            link.append(b)

        for c in get_pub_dates(url):
            gmt = pytz.timezone('GMT')
            eastern = pytz.timezone('US/Eastern')
            date = datetime.strptime(c[:-4], '%a, %d %b %Y %H:%M:%S')
            date_gmt = gmt.localize(date)
            date_eastern = date_gmt.astimezone(eastern)

            pub_date.append(date_eastern)

        for d in get_sources(url):
            source.append(d)

        rss_list = list(zip(headline, link, pub_date, source, keyword, rss_source))
        return rss_list

Improvement Ideas

Here are my thoughts on where to go next to improve this project.

  • Automatic populate a keyword list based on symbols of my portfolio holdings
  • Correlate news with price movement to determine if the news had an impact on the corresponding stock price
  • Filter out duplicate headlines from multiple sources
  • Filter news that had no notable impact in the stock price
  • Populate a summary for the week of the most notable articles that 1) had the most impact to the market or the stock price or 2) popularity of the article

Stay tuned in to the blog for more updates to this project.

Related

The Future of Stock News

We are launching a news platform that filters out the noise and shows only the most relevant and important news to you.