If you find a situation where you need to identify if a site links to you.  For the purpose of this exercise we will be referencing the homepage but we can expand the article later to discuss how to crawl a site where the basic principles are the same but replicated for each URL in the site… ok, let’s get started.

Remotely Scrape HTML File

The first step is to remotely grab the HTML source of a website.  There are two main ways to achieve this cURL and file_get_contents().

file_get_contents()

The most simple and most widely used method of obtaining the contents of a file is file_get_contents(). While this works great locally, to remotely fetch a file’s contents you will need to check that allow_url_fopen is enabled (which may present security risks).  To implement you would use the following line:

For the purpose of this example we will instead be using cURL.

cURL

You can learn specifics at php.net’s cURL Manual.

This will return the site’s HTML.  To prevent it from rendering so you can read it easier consider converting the content type to plain/text by adding the following line at the top of your document:

Is the site linking to you?

Now that we have the HTML we can search it to identify if a URL is present.  This can be done fairly easily using Regular Expression.  We will use a few functions for this, preg_quote (escape special characters), some light regex and preg_match_all (to check if the link is in the HTML we returned.

 

So you should now see a message indicating if the link is found or not.  Ok, so a few questions probably come to mind for the inquisitive minds:

1. What is the deal with the n  in the php?

Good question, this creates a new line (or carriage return) when in text/plain content mode.  Learn more.

2. In the preg_match_all there is a $matches, whats up?

If we want to return the line that the URL shows up on we could display the $matches array like so:  var_dump($matches);  this will return a dump of the array $matches (if there was a match found).  Now you can output this line using:

3. What if the URL is not valid?

To determine if a URL is valid or not leads you on a quest for the perfect regex code which points to @dperini‘s version:

Equipped with this code we can proceed with validation.

4. What about if the URL was uppercase or lowecase, or did not use a www?

While the above regex would be the most comprehensive, for this exercise we will just use the following:

Try it out

Give it a test run: