Pull data from the open internet with Diffbot and PHP

Diffbot is an amazing search engine that catalogues the public web and groups the information using machine learning. They have an open API providing that information. This is how we used the Diffbot API in a Laravel application to pull information about tech companies.

I recently got in touch with the DIFFBOT team and obtained an API key to use their service to pull data from the open internet. In this tutorial we'll go through how to build a database seeder with Laravel to pull data from an API and save it in Postgres. The same principles and similar approach should be helpful to people coming from other PHP frameworks like Symfony, Laravel developers or anyone using another structured database like MySQL or MariaDB. If you come from other languages like Python, Ruby or something random like Clojurescript I hope you find value in it too to build amazing service that take advantage of a loosely structured collection of openly accessible internet data. Hooray!

We're going to specifically focus on pulling data about tech companies in this blog post but a similar approach could be used for pulling information about web articles, products or something custom you write for yourself. Imagine you have Google's technology what would you build? Normally I'm pretty skeptical of Artificial Intelligence (AI) and Machine Learning (ML) claims but what they've built to allow regular developers access to the open internet is pretty cool. I think they've got some cred when they write Diffbot allows you to Search or extract anything on the web. Diffbot uses machine learning to transform the internet into accessible, structured data.

Create a Laravel database seeder

The first step is a little bit of config for our Laravel project. We're going to take data from the Diffbot API and store it in a database. For that Laravel has database seeding functionality that allows us to write scripts known as "seeders" to populate data in our SQL database. Seeders can be used to fill your database with fake data using PHP's faker library. In our case though we'll make an API request and store the result in an existing table. We'll ship this to production and be good to go!

The first step is to create a seeder in an existing Laravel project:

php artisan make:seeder DiffbotCompaniesSeeder

Set up a database table

Postgres, Postgresql, PSQL, Amazon Redshift whatever you want to call it. We have Postgres running with Laravel. Laravel makes it easy to work with the following database backends:

  • MySQL 5.6+

  • PostgreSQL 9.4+

  • SQLite 3.8.8+

  • SQL Server 2017+

Using Postgres with Laravel is great. I'm still learning Postgres. One feature of Laravel's Eloquent ORM is JSON casting that makes fluidity between JSON objects and PHP arrays.

In our case we have a table with a JSONB Postgres column type in the following PHP database migration:

Schema::create('diffbot_companies', function (Blueprint $table) {
    $table->bigIncrements('id');
    $table->integer('company_id')->unique();
    $table->string('diffbot_id')->unique();
    $table->jsonb('diffbot_response')->nullable();
    $table->timestamps();
});

If you're wondering the difference between JSON and JSONB you're not alone. JSONB is slightly slower to input but much faster to read. We're only going to be inputting the data every so often via seeder. We want fast reads!

Make the API call and save the response

Now that we've got a table to store our data in we can go fetch it. We already have a list of tech companies I've been building up in spreadsheet for years. If you're interested in that data you can view it on GitHub it's open source. We have these organizations stored in a companies table. With our seeder we're going to save one response for each of our companies. We'll ensure uniqueness through the diffbot_id column. Diffbot has an Enhance API where you can fetch data about organizations. Each organization in their system has a unique id. This is how we query for our companies and then make an API call using the defacto standard Guzzle PHP library. We're going to use the company name and company url that we have saved in our companies table to make the request.

<?php

use App\Company;
use App\DiffbotCompanies;
use Illuminate\Database\Seeder;
use GuzzleHttp\Client;

class DiffbotCompaniesSeeder extends Seeder
{
    /**
     * Run the database seeds.
     *
     * @return void
     */
    public function run()
    {
        Company::chunk(200, function($companies){
            $client = new Client([
                'base_uri' => config('diffbot.base_uri'),
                'timeout'  => config('diffbot.guzzle_timeout')
            ]);
            $companies->each(function($company) use ($client) {
                $response = $this->getDiffbotCompany($client, $company);

                if(empty($response) || $response->getStatusCode() !== 200) {
                    $errorMessage = sprintf('Error grabbing organization info for %s \n', $company->name);
                    echo $errorMessage;
                    \Log::error($errorMessage);
                    return true;
                }

                $organizationResponse = json_decode($response->getBody());

                if (empty($organizationResponse) || empty($organizationResponse->data)) {
                    echo "[Error] Unable to get info for " . $company->name . "\n";
                    return true;
                }

                info('Getting data for ' . $company->name);

                $diffbotId = $organizationResponse->data[0]->entity->id;
                $diffbotResponse = $organizationResponse->data[0]->entity;

                if (empty($diffbotId) || empty($diffbotResponse)) {
                    echo '[ERROR] Parsing response for ' . $company->name . "\n";
                    return true;
                }

                echo 'Saving response for ' . $company->name . "\n";

                try {
                    // Save diffbot response for company
                    $diffbotCompany = new DiffbotCompanies();
                    $diffbotCompany->company_id = $company->id;
                    $diffbotCompany->diffbot_id = $diffbotId;
                    $diffbotCompany->diffbot_response = json_decode( json_encode($diffbotResponse), true);
                    $diffbotCompany->save();
                } catch (\Exception $e) {
                    echo $e->getMessage();
                    info($e->getMessage());
                    return true;
                }
            });
        });
    }

    /**
     * Make API call to get info about the Organization from Diffbot
     * @param Client $client
     * @param Company $company
     * @return \Psr\Http\Message\ResponseInterface|null
     * @throws \GuzzleHttp\Exception\GuzzleException
     */
    protected function getDiffbotCompany(Client $client, Company $company)
    {
        try {
            $queryParams = [
                'type' => 'Organization',
                'token' => config('diffbot.api_token'),
                'useCache' => true,
                'name' => $company->name,
                'url' => $company->website
            ];

            return $client->request('GET', 'kg/v3/enhance_endpoint', [
                'query' => $queryParams
            ]);
        } catch (\Exception $e) {
            echo $e->getMessage();
            info($e->getMessage());
            return null;
        }
    }
}