How to fetch Google News from RSS and then filter with source 2019

The output of Google News RSS is in XML format. Here is a sample Google News RSS XML output:

<rss xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        <generator>NFE/5.0</generator>
        <title>"Apple" - Google News</title>
        <link>
            https://news.google.com/search?q=apple
        </link>
        <language>en-IN</language>
        <webMaster>news-webmaster@google.com</webMaster>
        <copyright>2019 Google Inc.</copyright>
        <lastBuildDate>xxxxxxxxxxxxx</lastBuildDate>
        <description>Google News</description>
        <item>
            <title>
                HomePod available in China starting Friday, January 18 - Apple Newsroom
            </title>
            <link>
                https://www.apple.com/newsroom/2019/01/HomePod-available-in-china-starting-friday-january-18/
            </link>
            <pubDate>Sun, 13 Jan 2019 21:57:48 GMT</pubDate>
            <description>
                <a href="https://www.apple.com/newsroom/2019/01/HomePod-available-in-china-starting-friday-january-18/" target="_blank">HomePod available in China starting Friday, January 18</a>&nbsp;&nbsp;<font color="#6f6f6f">Apple Newsroom</font><p>HomePod, the innovative wireless speaker from Apple, will be available in mainland China and Hong Kong markets starting Friday, January 18.</p>
            </description>
            <source url="https://www.apple.com">Apple Newsroom</source>
            <media:content url="https://lh5.googleusercontent.com/proxy/3sb76nYiUcZoYNn3vMBTrH0dbNTM0r73U5lBdJdHlU10Y1o-8HfGmUBJhogpIrdmr4YybfRtSHUb7pdrbrIHmnT48bn-KzHiuNpha_GnkjyokluuT0WMbxZSn5oNO_Znmz550OL4XZAuEzfRx_Ai3KR11avjFAf9sNM6eLccqsXxMrniTtF4zvtcfso2n6MGO7pzbWM=-w150-h150-c" medium="image" width="150" height="150"/>
        </item>
    </channel>

There are two methods to parse the XML. But first let’s talk about the XML nodes of different types and how to extract of data from that nodes. After going through the article you’ll know why I’ve listed two methods. Here are some different types of XML nodes:

  1. Tags with hyphen(-)
     <item-name>

    In such case you’ve to use $news->{'item-name'}

  2. Tags with colon(:)
     <item:name>

    Data from such tags can only be retrieved by knowing the namespace of the XML.
    What is XML namespace? Here

Now listing the two methods which will help you fetch the google news and then show in your website or you can just save in your database:

  1. Through file_get_contents and then manipulating the XML string (Recommended)
    public function getNewsFromGoogle($query) {
        $newsXml = file_get_contents('https://news.google.com/rss/search?q=' . urlencode($query));
        $newsXml = preg_replace("/(<\/?)(\w+):([^>]*>)/", "$1$2$3", $newsXml); // this will convert <media:content> to <mediacontent>
        $newsXml = simplexml_load_string($newsXml);
        $news = [];
    
        foreach ($newsXml->channel->item as $item) {
            $details = [];
            $title = (string) $item->title;
    
            if ($title == "This RSS feed URL is deprecated") {
                continue;
            } else {
                $details['title'] = trim($title);
                $details['description'] = trim(strip_tags((string) $item->description));
                $published_date = (string) $item->pubDate;
                $published_date = date('Y-m-d H:i:s', strtotime($published_date));
                $details['published_date'] = $published_date;
                $details['url'] = $item->link;
    
                if (isset($item->mediacontent)) {
                    $details['image'] = $item->mediacontent["url"];
                } else {
                    $details['image'] = null;
                }
    
                array_push($news, $details);
            }
        }
    
        return $news;
    }
  2. Directly hitting the API with simplexml_load_file
    public function getNewsFromGoogle($query) {
        $newsXml = simplexml_load_file('https://news.google.com/rss/search?q=' . urlencode($query));
        $ns = $newsXml->getNamespaces(true); // use only if there are tags such as <media:content> i.e. with colon(:)
        $news = [];
    
        foreach ($newsXml->channel->item as $item) {
            $details = [];
            $media = $item->children($ns["media"]); // media is the namespace, refer the XML sample above
            $title = (string) $item->title;
    
            if ($title == "This RSS feed URL is deprecated") {
                continue;
            } else {
                $details['title'] = trim($title);
                $details['description'] = trim(strip_tags((string) $item->description));
                $published_date = (string) $item->pubDate;
                $published_date = date('Y-m-d H:i:s', strtotime($published_date));
                $details['published_date'] = $published_date;
                $details['url'] = $item->link;
    
                if (isset($media->content)) {
                    $details['image'] = $media->content["url"]; // get value of attributes like these
                } else {
                    $details['image'] = null;
                }
    
                array_push($news, $details);
            }
        }
    
        return $news;
    }

Note:

  1. While searching you can apply filters(https://support.google.com/websearch/answer/2466433):
    • For exact match, use double quotes(“”), so the URL will become:
      https://news.google.com/rss/search?q=”‘ . urlencode($query) . ‘”‘
    • To filter source website, use ‘site‘ operator:
      ‘https://news.google.com/rss/search?q=”site:hindustantimes.com%20’ . urlencode($query) . ‘”‘
    • To apply multiple source website:
      https://news.google.com/rss/search?q=”site:hindustantimes.com%20OR%20site:nytimes.com%20′ . urlencode($query) . ‘”‘
    • %20” is for space, if you don’t use space with filters, you won’t find any search result.

Enjoy 😉