I have been building and maintaining a data lake in AWS for the past year or so and it has been a learning experience to say the least. Recently I had an issue where a AWS Glue crawler stopped updating a table in the catalog that represented raw syslog data that was being imported in.
The error being shown was:
INFO : Multiple tables are found under location [S3 bucket and path]. Table [table name] is skipped.
So basically the table stayed there, but any new data being added was not being updated into the table along with the partitions I was using. I was super confused as to what was going on. I ended up noticing that the crawler started to fail around a day that I new some activity was going on in the account by other users.
When I looked in the S3 bucket, I found someone had left a different file there. I believe in was happenstance by syncing over a file set that included an additional log file at the base level they didn’t notice.
Anyhow, that was enough for the crawler to be confused. I had the crawler set to treat everything in that path as a single table, so it didn’t like a file that that was in a syslog format like everything else.
Solution: Delete that file. Run the crawler again.
After thinking on it a bit, I remember that using the AWS CLI for S3 would have been a quick way to look for files. If you aren’t familiar with the command line interface (CLI) for AWS, Google for help on it.
Basically I tried using this S3 list command to recursively look through the folder and pipe the results to the sort command to get them in modified date order. Then I scrolled through the list to around the time the failures started and found the new file added that was the oddball.
aws s3 ls s3://bucketname/path/ --recursive | sort