Welcome to Soft32 Linux Forums!
FAQFAQ    SearchSearch      ProfileProfile    Private MessagesPrivate Messages   Log inLog in

Wget request from Linux-PHP [expert] ?

 
   Soft32 Home -> Linux -> App Development RSS
Next:  [PATCH 0/10] Per-bdi writeback flusher threads v1..  
Author Message
Pseudonyme

External


Since: Jun 17, 2009
Posts: 4



(Msg. 1) Posted: Wed Jun 17, 2009 6:19 am
Post subject: Wget request from Linux-PHP [expert] ?
Archived from groups: comp>os>linux>development>apps, others (more info?)

Hi,

I am working in an admin department where I have to retrieve the
information from a State Website regarding 1.500 companies on a daily
basis. Manually doing the job represents a couple of hours. I try to
computerize that process to save energy.

I try to wget the data from a Linux/PHP server. Do you know how to
retrieve information from complicated WGET ?

Eg : the KOUGLOFF company.

State website is : http://www.infogreffe.fr/infogreffe/reset.do
Siren entry is : 448676973
Manually doing the job provide the information regarding KOUGLOFF.

Once, I get a session ID, I can copy and paste that link to retireve
the info :
http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973


I created a unix application ready to retrieve my daily 1.500
companies using recurring WGET.

Problem is that, as I do not have a session ID, my server cannot WGET
the file. I try to add a parallel session ID to access the file but
that does not lead me to the content :
http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=4...76973&b


From your knowledge and experience : Do you know how I could retrieve
the information ?
Adding a sessionid to the URL did not lead to a success for a Linux
script.

Thank you very much for any help, solution or advice.

Norman.
Back to top
Login to vote
Pseudonyme

External


Since: Jun 17, 2009
Posts: 4



(Msg. 2) Posted: Wed Jun 17, 2009 9:25 am
Post subject: Re: Wget request from Linux-PHP [expert] ? [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Hi,
thanks : so, carefully reading the CURL command instead of the WGET
command.

The CURL instruction can be asked at least from LINUX and PHP. LINUX
commands are working much better, to my initial opinion. The website
listed above seems not to use cookies.
The access to the KOUGLOFF company is available through session
recognition. It seems that the principle is : no session equals no
access to the company details.

Problem is that the session principles using CURL under Linux is not
so easy reading : http://linux.about.com/od/commands/l/blcmdl1_curl.htm
There are informations regarding cookies but nothing regarding
sessions.

curl_init(), curl_setopt(), curl_exec(), curl_close() seem to be only
available using PHP.

Thank you very much for any help, operating solutions or advice.

Norman.
Back to top
Login to vote
Erwin Moller

External


Since: Jun 17, 2009
Posts: 2



(Msg. 3) Posted: Wed Jun 17, 2009 11:20 am
Post subject: Re: Wget request from Linux-PHP [expert] ? [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Pseudonyme schreef:
> Hi,
>

Hello Norman,

> I am working in an admin department where I have to retrieve the
> information from a State Website regarding 1.500 companies on a daily
> basis. Manually doing the job represents a couple of hours. I try to
> computerize that process to save energy.

Makes sense.

>
> I try to wget the data from a Linux/PHP server. Do you know how to
> retrieve information from complicated WGET ?

Why use WGET if you are on PHP?
Why not simply use file() or file_get_contents() and feed it an URL?
http://nl3.php.net/manual/en/function.file.php


>
> Eg : the KOUGLOFF company.
>
> State website is : http://www.infogreffe.fr/infogreffe/reset.do
> Siren entry is : 448676973
> Manually doing the job provide the information regarding KOUGLOFF.
>
> Once, I get a session ID, I can copy and paste that link to retireve
> the info :
> http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973
>
>
> I created a unix application ready to retrieve my daily 1.500
> companies using recurring WGET.
>
> Problem is that, as I do not have a session ID, my server cannot WGET
> the file. I try to add a parallel session ID to access the file but
> that does not lead me to the content :
> http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=4...76973&b
>

OK, that makes sense.
You must first log in, so you have an active session.
After that you can get the info.

A few remarks:
1) Judging by the length of the sessionid, this is not a standard PHP
generated sessionid which are shorter.
2) maybe they only accept php sessionid via a cookie instead of GET.

If I were you I would start by figuring out how the sessionid is
transferred to you. Is it in a cookie? In a form? (Appearantly it is not
meant to be in the URL)

Maybe consider using CURL instead of WGET or file() as I suggested above.
http://nl3.php.net/manual/en/book.curl.php

Using CURL, you can add cookies that contain sessionid, and also mimic
POSTS reliably.

Regards,
Erwin Moller



--
"There are two ways of constructing a software design: One way is to
make it so simple that there are obviously no deficiencies, and the
other way is to make it so complicated that there are no obvious
deficiencies. The first method is far more difficult."
-- C.A.R. Hoare
Back to top
Login to vote
Nick Birnie

External


Since: Jun 13, 2009
Posts: 4



(Msg. 4) Posted: Wed Jun 17, 2009 11:20 am
Post subject: Re: Wget request from Linux-PHP [expert] ? [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

On 06/17/2009 02:19 PM, Pseudonyme wrote:
> Hi,
>
> I am working in an admin department where I have to retrieve the
> information from a State Website regarding 1.500 companies on a daily
> basis. Manually doing the job represents a couple of hours. I try to
> computerize that process to save energy.
>
> I try to wget the data from a Linux/PHP server. Do you know how to
> retrieve information from complicated WGET ?
>
> Eg : the KOUGLOFF company.
>
> State website is : http://www.infogreffe.fr/infogreffe/reset.do
> Siren entry is : 448676973 Manually doing the job provide the
> information regarding KOUGLOFF.
>
> Once, I get a session ID, I can copy and paste that link to retireve
> the info :
> http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973
>

Looking
>
at the HTTP headers (http://pastebin.com/f8ebfee) retrieving
the above page involves 2 redirects and 1 cookie.

>
> I created a unix application ready to retrieve my daily 1.500
> companies using recurring WGET.
>
> Problem is that, as I do not have a session ID, my server cannot WGET
> the file. I try to add a parallel session ID to access the file but
> that does not lead me to the content :
> http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=4...76973&b

If
>
wget is giving you problems, then you can either try and fix them,
or else use another API. Something like libcurl will handle your network
IO for you and provide an API to read and set the HTTP headers.

> From your knowledge and experience : Do you know how I could retrieve
> the information ? Adding a sessionid to the URL did not lead to a
> success for a Linux script.

I would look into using libcurl based on what you've said and the
headers. It's be a simple matter to parse the cookies and sessionID to
use in subsequent requests.

>
> Thank you very much for any help, solution or advice.
>
> Norman.
>
>
>
>
>
Back to top
Login to vote
Pseudonyme

External


Since: Jun 17, 2009
Posts: 4



(Msg. 5) Posted: Thu Jun 18, 2009 1:02 am
Post subject: Re: Wget request from Linux-PHP [expert] ? [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Hi,
To retrieve the content, I will not use : wget, file() nor
file_get_contents. I believe CURL using a UNIX script is more
effective.
The point is that the CURL has to open a session to access the
detailed content.
From the documentation, opening a session with UNIX/CURL command is
not documented.
Do you know how to open a session using a UNIX/CURL command ?
Norman
Back to top
Login to vote
Erwin Moller

External


Since: Jun 17, 2009
Posts: 2



(Msg. 6) Posted: Thu Jun 18, 2009 3:21 am
Post subject: Re: Wget request from Linux-PHP [expert] ? [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Pseudonyme schreef:
> Hi,
> thanks : so, carefully reading the CURL command instead of the WGET
> command.
>
> The CURL instruction can be asked at least from LINUX and PHP. LINUX
> commands are working much better, to my initial opinion. The website
> listed above seems not to use cookies.
> The access to the KOUGLOFF company is available through session
> recognition. It seems that the principle is : no session equals no
> access to the company details.
>
> Problem is that the session principles using CURL under Linux is not
> so easy reading : http://linux.about.com/od/commands/l/blcmdl1_curl.htm
> There are informations regarding cookies but nothing regarding
> sessions.
>
> curl_init(), curl_setopt(), curl_exec(), curl_close() seem to be only
> available using PHP.
>
> Thank you very much for any help, operating solutions or advice.
>
> Norman.
>

What part excactly in my previous answer did you not understand?

Erwin Moller

--
"There are two ways of constructing a software design: One way is to
make it so simple that there are obviously no deficiencies, and the
other way is to make it so complicated that there are no obvious
deficiencies. The first method is far more difficult."
-- C.A.R. Hoare
Back to top
Login to vote
Danny Wilkerson

External


Since: Jun 18, 2009
Posts: 1



(Msg. 7) Posted: Thu Jun 18, 2009 5:44 am
Post subject: Re: Wget request from Linux-PHP [expert] ? [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

You feel like you are beating your head against a wall? Hehe. This
groups is php based so it's ok to assume we are talking about php.
Some people just don't understand what is going on and they expect you
to do their work for them. It's not hard. Your answer was perfect and
if he/she does not get it, let them go somewhere else.
Back to top
Login to vote
Jerry Stuckle

External


Since: Jun 18, 2009
Posts: 3



(Msg. 8) Posted: Thu Jun 18, 2009 6:48 am
Post subject: Re: Wget request from Linux-PHP [expert] ? [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Pseudonyme wrote:
> Hi,
> To retrieve the content, I will not use : wget, file() nor
> file_get_contents. I believe CURL using a UNIX script is more
> effective.
> The point is that the CURL has to open a session to access the
> detailed content.
> From the documentation, opening a session with UNIX/CURL command is
> not documented.
> Do you know how to open a session using a UNIX/CURL command ?
> Norman

http is a stateless protocol - there is no such thing as a session in
the protocol. The only "session" is that which is defined by the server
software. Therefore, there is no special API for establishing a session.

For this session to work, the server sends a session id to the client,
and the client responds with this session id on each request. The
session id may be stored on the client in a cookie (the most common
case), or it may be a parameter in the URI (generally when the client
does not support cookies). cURL can handle cookies just fine; if
instead it's part of the URI you'll have to parse the page containing
the link to get the session id.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex.RemoveThis@attglobal.net
==================
Back to top
Login to vote
Pseudonyme

External


Since: Jun 17, 2009
Posts: 4



(Msg. 9) Posted: Thu Jun 18, 2009 8:59 am
Post subject: Re: Wget request from Linux-PHP [expert] ? [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Hi all, CURL Command
1) That is working ok getting properly the content in non-obscured
website like. That works for transparency-governed websites like :
> curl 'http://www.abca.com' | more


2) To retrieve properly the content of the KOUGLOF company

from here: http://tinyurl.com/lyhsmh

SIREN : 453786980


That is a major problem. We tested to the SIREN POST data :

> curl -F "siren=453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml | more
> curl -F "siren=@453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml | more
> curl -d "siren=453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=453786980

No content at-all can be retrieved !... and we carefully read 3 times
each of your message as well as all the CURL UNIX official
documentation.

That one suffer from the same transparency problem :
http://avis-situation-sirene.insee.fr/avisitu/IdentificationListeSiret...?bSubmi

Problem is that I have to suggest my initial searches to my
supervisor, and I do not see where to progress and get the answer.
Thank you for any help or operating solutions,
Norman
Back to top
Login to vote
Jerry Stuckle

External


Since: Jun 18, 2009
Posts: 3



(Msg. 10) Posted: Thu Jun 18, 2009 10:25 am
Post subject: Re: Wget request from Linux-PHP [expert] ? [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Danny Wilkerson wrote:
> You feel like you are beating your head against a wall? Hehe. This
> groups is php based so it's ok to assume we are talking about php.
> Some people just don't understand what is going on and they expect you
> to do their work for them. It's not hard. Your answer was perfect and
> if he/she does not get it, let them go somewhere else.

No, I'll try to help those who are interested in learning. The op
obviously is not familiar with how sessions work, which is quite common.
Most of the time it can be somewhat ignored because PHP does most of
the session handling behind the scenes. However, when you start trying
to do the actions the op is talking about, it requires a little better
understanding about how sessions work.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex DeleteThis @attglobal.net
==================
Back to top
Login to vote
Tauno Voipio

External


Since: Oct 23, 2005
Posts: 36



(Msg. 11) Posted: Thu Jun 18, 2009 1:20 pm
Post subject: Re: Wget request from Linux-PHP [expert] ? [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Pseudonyme wrote:
> Hi all, CURL Command
> 1) That is working ok getting properly the content in non-obscured
> website like. That works for transparency-governed websites like :
>> curl 'http://www.abca.com' | more
>
>
> 2) To retrieve properly the content of the KOUGLOF company
>
> from here: http://tinyurl.com/lyhsmh
>
> SIREN : 453786980
>
>
> That is a major problem. We tested to the SIREN POST data :
>
>> curl -F "siren=453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml | more
>> curl -F "siren=@453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml | more
>> curl -d "siren=453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=453786980
>
> No content at-all can be retrieved !... and we carefully read 3 times
> each of your message as well as all the CURL UNIX official
> documentation.
>
> That one suffer from the same transparency problem :
> http://avis-situation-sirene.insee.fr/avisitu/IdentificationListeSiret...?bSubmi
>
> Problem is that I have to suggest my initial searches to my
> supervisor, and I do not see where to progress and get the answer.
> Thank you for any help or operating solutions,
> Norman


Obviously, you are trying to do something the website
designer has decided to avoid. He wants to prevent
automated vacuuming of his data base without harming
genuine manual queries.

One reason may be that your curl command does not
send the proper browser headers with the request.

--

Tauno Voipio
tauno voipio (at) iki fi
Back to top
Login to vote
Jerry Stuckle

External


Since: Jun 18, 2009
Posts: 3



(Msg. 12) Posted: Thu Jun 18, 2009 1:20 pm
Post subject: Re: Wget request from Linux-PHP [expert] ? [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

Pseudonyme wrote:
> Hi all, CURL Command
> 1) That is working ok getting properly the content in non-obscured
> website like. That works for transparency-governed websites like :
>> curl 'http://www.abca.com' | more
>
>
> 2) To retrieve properly the content of the KOUGLOF company
>
> from here: http://tinyurl.com/lyhsmh
>
> SIREN : 453786980
>
>
> That is a major problem. We tested to the SIREN POST data :
>
>> curl -F "siren=453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml | more
>> curl -F "siren=@453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml | more
>> curl -d "siren=453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=453786980
>
> No content at-all can be retrieved !... and we carefully read 3 times
> each of your message as well as all the CURL UNIX official
> documentation.
>
> That one suffer from the same transparency problem :
> http://avis-situation-sirene.insee.fr/avisitu/IdentificationListeSiret...?bSubmi
>
> Problem is that I have to suggest my initial searches to my
> supervisor, and I do not see where to progress and get the answer.
> Thank you for any help or operating solutions,
> Norman
>
>

That's because the site is using javascript on the button clicks. You
will have to emulate the javascript to get it to work - and this is
likely to fail if they change the page and/or javascript.

I'm with Tauno on this one - the site is obviously designed to prevent
what you're trying to do. I would recommend you contact the company to
see if there is another way to retrieve the data.


--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex.RemoveThis@attglobal.net
==================
Back to top
Login to vote
Display posts from previous:   
Related Topics:
How can i register a link protocol for browsers - I want to add an action to the system so that every time a user clicks in a link with for example ..

storing the data in ln2440 target board memory - hai.. i using the target board ln2440sbc..and linux OS.. (16Mb flash and 32Mb sdram) i have cross compiled and porte...

GTK run error - hi i had created one same application in glad interface designer with c option.the n i compiled the same after that i....

C++ -> mplayer, stdin, stdout, popen / fork / pipe... - Hello everybody. I tried to work this out on my own for about 6 hours non-stop now and I just don't get it working.......

Explaining Linux video to a Windows expert - I was trying to explain to a friend at work how Linux video works ( xwindows, GNOME ) differing from ms. Can anyone..

Yelena Doyon, but expert - /_/_/_/_/_/_/_/_/_/_/_/_/C/_/_/I/_/_/V/_/_/X/_/_/_/_/_/_/_/_/_/_/_/_/ CTR Investments and Consulting Inc (CIVX) THIS...
       Soft32 Home -> Linux -> App Development All times are: Pacific Time (US & Canada) (change)
Page 1 of 1

 
You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Categories:
 Windows
  Linux
 Mac
 PDA


[ Contact us | Terms of Service/Privacy Policy ]