Thursday, May 31, 2012

How do I prevent site scraping?


I have a fairly large music website with a large artist database. I've been noticing other music sites scraping our site's data (I enter dummy Artist names here and there and then do google searches for them).



How can I prevent screen scraping? Is it even possible?


Source: Tips4all

19 comments:

  1. I will presume that you have set up robots.txt.

    As others have mentioned, scrapers can fake nearly every aspect of their activities, and it is probably very difficult to identify the requests that are coming from the bad guys.

    What I would consider doing is:


    Set up a page /jail.html
    Disallow access to the page in robots.txt (so the respectful spiders will never visit)
    Place a link on one of your pages, hiding it with CSS (display: none).
    Record IPs of visitors to /jail.html


    This might help you to quickly identify requests from scrapers that are flagrantly disregarding your robots.txt.

    You might also want to make your /jail.html a whole entire website that has the same, exact markup as normal pages, but with fake data (/jail/album/63ajdka, /jail/track/3aads8, etc.). This way, the bad scrapers won't be alerted to "unusual input" until you have the chance to block them entirely.

    ReplyDelete
  2. There is really nothing you can do to completely prevent this. Scrapers can fake their user agent, use multiple IPs, etc and appear as a normal user. The only thing you can do is make the text not available at the time the page is loaded - make it with image, flash, or load it with javascript. However, the first 2 are bad ideas, and the last one would be an accessibility issue if js is not enabled for some of your regular users.

    If they are absolutely slamming your site and rifling through all of your pages, you could do some kind of rate limiting.

    There is some hope though. Scrapers rely on your site's data being in a consistent format. If you could randomize it somehow it could break their scraper. Things like changing the ID or class names of page elements on each load, etc. But that is a lot of work to do and I'm not sure if it's worth it. And even then, they could probably get around it with enough dedication.

    ReplyDelete
  3. Sue `em.

    Seriously: If you have some money, talk to a good, nice, young lawyer who knows their way around the Internets. You could really be able to do something here. Depending on where the sites are based, you could have a lawyer write up a cease & desist or its equivalent in your country. You may be able to at least scare the bastards.

    Document the insertion of your dummy values. Insert dummy values that clearly (but obscurely) point to you. I think this is common practice with phone book companies, and here in Germany, I think there have been several instances when copycats got busted through fake entries they copied 1:1.

    It would be a shame if this would drive you into messing up your HTML code, dragging down SEO, validity and other things (even though a templating system that uses a slightly different HTML structure on each request for identical pages might already help a lot against scrapers that always rely on HTML structures and class/ID names to get the content out.)

    Cases like this are what copyright laws are good for. Ripping off other people's honest work to make money with is something that you should be able to fight against.

    ReplyDelete
  4. Sorry It's really quite hard to do this...

    I would sugget that you politely ask them to not use your content (if your content is copywrited)

    If it is and they don't take it down, then you can take furthur action and send them a cease and desist letter

    Generally what ever you do to prevent scaping will probably end up with a more negative effect. e.g. accesbility, bots/spiders etc.

    ReplyDelete
  5. Okay as all posts says if you want to make it search engine friendly then bots can scrap for sure .

    But few things you can still do and it may be affective for 60-70 % scrapping bots.

    Make a checker script like below.

    if an particular ip visiting very fast then after few visits (5-10) put it ip+browser info in a file or DB.

    Next Step.
    (This would be a background process and running all time or scheduled after few minutes)
    Make one another script that will keep on checking those suspicious ips.

    Case 1. If the user Agent is of known search engine like google, bing,yahoo (you can find more info on user agents by googling it). then you must see http://www.iplists.com/ this list and try to match patterns .And if it seems a faked user-agent then ask to fill captcha on next visit. (You need to research a bit more on bots ips . I know this is achievable and also try whois of ip ,can be helpful)

    Case 2. No user agent of a search bot simply ask to fil capthca on next visit.

    Hope above will help

    ReplyDelete
  6. Provide an XML API to access your data; in a manner that is simple to use. If people want your data, they'll get it, you might as well go all out.

    This way you can provide a subset of functionality in an effective manner, ensuring that, at the very least, the scrapers won't guzzle up HTTP requests and massive amounts of bandwidth.

    Then all you have to do is convince the people who want your data to use the API. ;)

    ReplyDelete
  7. Your best option is unfortunately fairly manual: look for traffic patterns that you believe are indicative of scraping and ban their IPs.

    Since you're talking about a public site then making the site search-engine friendly will also make the site scraping-friendly, if a search-engine can crawl and scrape your site then an malicious scraper can as well. It's a fine-line to walk.

    ReplyDelete
  8. Sure it's possible. For 100% success, take your site offline.

    In reality you can do some things that make scraping a little more difficult. Google does browser checks to make sure you're not a robot scraping search results (although this, like most everything else, can be spoofed).

    You can do things like require several seconds between the first connection to your site, and subsequent clicks. I'm not sure what the ideal time would be or exactly how to do it, but that's another idea.

    I'm sure there are several other people who have a lot more experience, but I hope those ideas are at least somewhat helpful.

    ReplyDelete
  9. There are a few things you can do to try and prevent screen scraping. Some are not very effective, while others (a CAPTCHA) are, but hinder usability. You have to keep in mind too that it may hinder legitimate site scrapers, such as search engine indexes.

    However, I assume that if you don't want it scraped that means you don't want search engines to index it either.

    Here are some things you can try:


    Show the text in an image. This is quite reliable, and is less of a pain on the user than a CAPTCHA, but means they won't be able to cut and paste and it won't scale prettily or be accessible.
    Use a CAPTCHA and require it to be completed before returning the page. This is a reliable method, but also the biggest pain to impose on a user.
    Require the user to sign up for an account before viewing the pages, and confirm their email address. This will be pretty effective, but not totally - a screen-scraper might set up an account and might cleverly program their script to log in for them.
    If the client's user-agent string is empty, block access. A site-scraping script will often be lazily programmed and won't set a user-agent string, whereas all web browsers will.
    You can set up a black list of known screen scraper user-agent strings as you discover them. Again, this will only help the lazily-coded ones; a programmer who knows what he's doing can set a user-agent string to impersonate a web browser.
    Change the URL path often. When you change it, make sure the old one keeps working, but only for as long as one user is likely to have their browser open. Make it hard to predict what the new URL path will be. This will make it difficult for scripts to grab it if their URL is hard-coded. It'd be best to do this with some kind of script.


    If I had to do this, I'd probably use a combination of the last three, because they minimise the inconvenience to legitimate users. However, you'd have to accept that you won't be able to block everyone this way and once someone figures out how to get around it, they'll be able to scrape it forever. You could then just try to block their IP addresses as you discover them I guess.

    ReplyDelete
  10. I work full time doing web scraping and have shared some of my techniques to stop web scrapers, based on what I find annoying.

    It is a tradeoff between your users and scrapers. If you limit IP's, use CAPTCHA's, require login, etc, you make like difficult for the scrapers. But this may also drive away your genuine users.

    ReplyDelete
  11. I agree with most of the posts above, and I'd like to add that the more search engine friendly your site is, the more scrape-able it would be. You could try do a couple of things that are very out there that make it harder for scrapers, but it might also affect your search-ability.. depends on how well you want your site to rank on search engines ofcourse.

    ReplyDelete
  12. No it's not possible to stop (in any way)
    Embrace it, why not publish as RDFa and become super search engine friendly and encourage the re-use of data, people will thank you and provide credit where due (see musicbrainz as an example)


    not the answer you probably want, but why hide what you're trying to make public?

    ReplyDelete
  13. You can't stop normal screen scraping. For better or worse, it's the nature of the web.

    You can make it so no one can access certain things (including music files) unless they're logged in as a registered user. It's not too difficult to do in Apache. I assume it wouldn't be too difficult to do in IIS as well.

    ReplyDelete
  14. Rather than blacklisting bots, maybe you should whitelist them. If you don't want to kill your search results for the top few engines, you can whitelist their user-agent strings, which are generally well-publicized. The less ethical bots tend to forge user-agent strings of popular web browsers. The top few search engines should be driving upwards of 95% of your traffic.

    Identifying the bots themselves should be fairly straightforward, using the techniques other posters have suggested.

    ReplyDelete
  15. From a tech perspective:
    Just model what Google does when you hit them with too many queries at once. That should put a halt to a lot of it.

    From a legal perspective:
    It sounds like the data you're publishing is not proprietary. Meaning you're publishing names and stats and other information that cannot be copyrighted.

    If this is the case, the scrapers are not violating copyright by redistributing your information about artist name etc. However, they may be violating copyright when they load your site into memory because your site contains elements that are copyrightable (like layout etc).

    I recommend reading about Facebook v. Power.com and seeing the arguments Facebook used to stop screen scraping. There are many legal ways you can go about trying to stop someone from scraping your website. They can be far reaching and imaginative. Sometimes the courts buy the arguments. Sometimes they don't.

    But, assuming you're publishing public domain information that's not copyrightable like names and basic stats... you should just let it go in the name of free speech and open data. That is, what the web's all about.

    ReplyDelete
  16. Putting your content behind a captcha would mean that robots would find it difficult to access your content. However, humans would be inconvenienced so that may be undesirable.

    ReplyDelete
  17. Screen scrapers work by processing html. And if they are determined to get your data there is not much you can do technically because the human eyeball processes anything. Legally its already been pointed out you may have some recourse though and that would be my recommendation.

    However, you can hide the critical part of your data by using non-html based presentation logic


    Generate a flash file for each artist/album etc
    Generate an image for each artist content. Maybe just an image for the artist name etc would be enough. Do this by rendering the text onto a jpg/png on the server and linking to that image.


    Bere in mind that this would probably affect your search rankings.

    ReplyDelete
  18. Generate the html, css and javascript. It is easier to write generators than parsers, so you could generate each served page differently. You can no longer use a cache or static content then.

    ReplyDelete
  19. If you want to see a great example, check out http://www.bkstr.com/. They use a j/s algorithm to set a cookie, then reloads the page so it can use the cookie to validate that the request is being run within a browser. A desktop app built to scrape could definitely get by this, but it would stop most cURL type scraping.

    ReplyDelete